[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN113505719A - Gait recognition model compression system and method based on local-integral joint knowledge distillation algorithm - Google Patents

Gait recognition model compression system and method based on local-integral joint knowledge distillation algorithm Download PDF

Info

Publication number
CN113505719A
CN113505719A CN202110824459.3A CN202110824459A CN113505719A CN 113505719 A CN113505719 A CN 113505719A CN 202110824459 A CN202110824459 A CN 202110824459A CN 113505719 A CN113505719 A CN 113505719A
Authority
CN
China
Prior art keywords
layer
convolution
network
model
pooling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110824459.3A
Other languages
Chinese (zh)
Other versions
CN113505719B (en
Inventor
单彩峰
宋旭
陈宇
黄岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University of Science and Technology
Original Assignee
Shandong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University of Science and Technology filed Critical Shandong University of Science and Technology
Priority to CN202110824459.3A priority Critical patent/CN113505719B/en
Publication of CN113505719A publication Critical patent/CN113505719A/en
Application granted granted Critical
Publication of CN113505719B publication Critical patent/CN113505719B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a gait recognition model compression system and method based on a local-integral joint knowledge distillation algorithm. The system adopts deep separation convolution to design a compact lightweight gait model network, the convolution network in the model network only reserves a backbone convolution network, and each convolution module in the backbone convolution network is simplified; the model network adopts a 16-layer lightweight fully-connected network, so that model parameters are greatly compressed, calculation of the recognition model is simplified, and the recognition efficiency is improved; the method of the invention simultaneously utilizes the local feature vectors output by the convolution network in the teacher model and the student model and the global feature vectors output by the full-connection network to carry out the joint knowledge distillation, not only retains the local features of the gait of the pedestrian by convolution operation, but also extracts the global features of the gait of the pedestrian by the full-connection operation, increases the information content of the knowledge distillation and improves the effect of identifying the gait of the pedestrian.

Description

Gait recognition model compression system and method based on local-integral joint knowledge distillation algorithm
Technical Field
The invention belongs to the technical field of image recognition, and particularly relates to a gait recognition model compression system and method based on a local-integral joint knowledge distillation algorithm.
Background
Gait recognition is an emerging biological feature recognition technology, and aims to find and extract feature changes among different pedestrians from a series of walking postures so as to realize automatic identification of the identity of the pedestrians. Gait recognition has many advantages over other biometric technologies, such as: the method has the advantages of no need of active matching of the identified objects, low requirement on image pixels, no limitation of visual angles, easiness in distinguishing camouflage, long identification distance and the like. Based on this, the gait recognition technology has very wide application in the fields of video monitoring, intelligent security and the like.
At present, gait recognition technology is mostly designed based on a standard convolutional neural network, and a recognition model is trained by collecting gait video samples containing pedestrian labels, so that the model learns useful gait appearance and motion characteristics from the samples and can recognize according to the characteristics. Existing gait recognition techniques can be divided into model-based methods and appearance-based methods depending on whether or not a human body model is established.
The gait features need to be extracted by establishing a human skeleton or a posture structure based on the model method, the calculation amount is large, and the network structure is complicated. The appearance-based method is the most common method at present, can directly extract gait features from a raw video captured by a camera, and can be subdivided into a feature template-based method, a gait video-based method and a set-based method.
The method based on the characteristic template (such as a gait energy map) is simple to realize but is easy to lose time sequence characteristics by extracting gait characteristics in the characteristic template for identification; the method based on the gait video sequence effectively improves gait space-time characteristics for recognition through a three-dimensional standard convolutional neural network, but the model is large in scale and difficult to train; the method is based on a set, single-frame gait contour features and set pooling structure aggregation gait space-time features in a gait set are extracted through a two-dimensional standard convolutional neural network, high-efficiency identification performance is achieved, and the model scale is still large.
In conclusion, the conventional gait recognition method is mainly realized by adopting a high-capacity neural network model, has the defects of more model parameters, long training time and difficult application and popularization, and is not suitable for practical application occasions with higher real-time requirements.
Although the traditional model compression method can reduce the capacity of the model and reduce the model parameters to a certain extent, the traditional model compression method is simpler and cannot keep the key information in the model, so that the identification performance of the model is seriously reduced, and therefore, the traditional model compression method is not suitable for solving the compression problem of the gait identification model.
Disclosure of Invention
The invention aims to provide a gait recognition model compression method based on a local-integral joint knowledge distillation algorithm, which can effectively ensure the gait recognition accuracy of a model while reducing the scale of model parameters.
In order to achieve the purpose, the invention adopts the following technical scheme:
gait recognition model compression system based on local-global joint knowledge distillation algorithm comprises:
comprises a teacher model Mt and a student model Ms(ii) a Wherein:
the teacher model Mt consists of a convolution network, an aggregation pooling structure, a horizontal pyramid pooling structure and a full-connection network;
the convolution network consists of a backbone network and a plurality of layers of global channels;
the backbone network consists of a first convolution module, a second convolution module and a third convolution module;
the first convolution module is comprised of three layers, wherein:
the first layer is a standard convolutional layer, using a 5 × 5 convolutional kernel; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the third layer is a pooling layer, and a largest pooling layer with a pooling core of 2 multiplied by 2 and a step length of 2 is adopted;
the second convolution module is comprised of three layers, wherein:
the first layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the third layer is a pooling layer, and a maximum pooling layer with a pooling core of 2 × 2 and a step length of 2 is used;
the third convolution module consists of two layers, wherein:
the first layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3;
the multilayer global channel consists of a fourth convolution module and a fifth convolution module;
the fourth convolution module consists of three layers, wherein: the first layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the third layer is a pooling layer, and 2 multiplied by 2 pooling cores are adopted;
the fifth convolution module consists of two layers, wherein:
the first layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3;
four collection pooling structures are arranged in the teacher model Mt, two horizontal pyramid pooling structures are arranged, and one full-connection network is arranged;
defining four collection pooling structures as a first, a second and a third collection pooling structure respectively; defining two horizontal pyramid pooling structures as a first horizontal pyramid pooling structure and a second horizontal pyramid pooling structure respectively;
the fully connected network comprises a first fully connected sub-network and a second fully connected sub-network;
the output of the first convolution module is connected with the input of the second convolution module;
the output of the first convolution module is also connected with the input of the fourth convolution module through the first set pooling structure;
the output of the second convolution module is connected with the input of the third convolution module;
the output of the second convolution module is connected with the input of the second set pooling structure, and the output of the second set pooling structure is added with the output of the fourth convolution module at the corresponding position and then connected with the input of the fifth convolution module;
the output of the third convolution module is connected with the input of the third collection pooling structure, and the output of the third collection pooling structure is added with the output of the fifth convolution module at corresponding positions and then connected with the input of the first horizontal pyramid pooling structure;
the output of the third convolution module is also connected with the input of the second horizontal pyramid pooling structure through a fourth set pooling structure;
the output of the first horizontal pyramid pooling structure is connected to the input of the first fully-connected sub-network;
the output of the second horizontal pyramid pooling structure is connected to the input of a second fully-connected sub-network;
the outputs of the first and second fully-connected sub-networks are used as teacher models MtAn output of (d);
the first horizontal pyramid pooling structure and the second horizontal pyramid pooling structure have five scales;
the first fully-connected sub-network and the second fully-connected sub-network respectively comprise 31 independent fully-connected neural network layers;
student model MsThe system consists of a simplified convolutional network, a set pooling structure, a simplified horizontal pyramid pooling structure and a simplified full-connection network, wherein:
the simplified convolution network consists of a sixth convolution module, a seventh convolution module and an eighth convolution module;
the sixth convolution module consists of two layers, wherein: the first layer is a standard convolutional layer, using a 5 × 5 convolutional kernel; the second layer is a pooling layer, and a maximum pooling layer with a pooling core of 2 × 2 and a step length of 2 is used;
the seventh convolution module consists of three layers, wherein:
the first layer is a deep separation convolutional layer, using a convolution kernel of 5 × 5; the second layer is a point convolution layer, using a convolution kernel of 1 × 1; the third layer is a pooling layer, and a largest pooling layer with a pooling core of 2 multiplied by 2 and a step length of 2 is adopted;
the eighth convolution module is comprised of two layers, wherein:
the first layer uses a 3 × 3 convolution kernel for the depth-separated convolution layer; the second layer is a point convolution layer, using a convolution kernel of 1 × 1;
the sixth convolution module, the seventh convolution module and the eighth convolution module are connected in sequence;
defining a student model MsThe middle collection pooling structure is a fifth collection pooling structure;
the output of the eighth convolution module is connected with the input of the reduced horizontal pyramid pooling structure through a fifth set pooling structure;
the output of the simplified horizontal pyramid pooling structure is connected with the input of the simplified fully-connected network;
the simplified horizontal pyramid pooling structure has only one scale; the simplified fully-connected network comprises 16 independent fully-connected neural network layers; the output of the streamlined fully-connected network is used as a student model MsTo output of (c).
In addition, the invention also provides a gait recognition model compression method based on the local-integral joint knowledge distillation algorithm, which is based on the gait recognition model compression system mentioned above, and the specific technical scheme is as follows:
the gait recognition model compression method based on the local-integral joint knowledge distillation algorithm comprises the following steps:
step 1, extracting a gait contour sequence in a gait video by using a background subtraction method, uniformly cutting the gait contour sequence into picture sets with the same size to form a data set X, and dividing the data set X into a training set XtrainAnd test set Xtest
Step 2. Using training set XtrainTeacher training model MtSetting learning rate and iteration times, wherein an Adam optimizer is adopted as the optimizer, and a triple loss function L shown in formula (1) is adopted as the loss functiontri
Figure BDA0003173157350000031
In the formula, Ntri+Representing a subset of training samplesThe total number of sample pairs consisting of two samples in which all Euclidean distances are not 0; the training sample subset is a training set X from each trainingtrainA set consisting of a plurality of randomly selected sample images;
n represents the number of the fully-connected neural network layers of the teacher network, and t represents the serial number of the fully-connected neural network layers of the teacher network;
p represents the number of pedestrians contained in each subset of training samples;
i and p respectively represent the serial numbers of the pedestrian samples to be trained in each training sample subset;
k represents the number of video sequences of each pedestrian in each subset of training samples;
a. j and k respectively represent the serial numbers of the pedestrian video sequences in each training sample subset;
m represents a boundary threshold of the loss function;
Figure BDA0003173157350000041
representing the jth sample to be trained in the ith training sample subset;
Figure BDA0003173157350000042
represents the ith subset of training samples
Figure BDA0003173157350000043
Any sample with the same identity of the represented pedestrian;
Figure BDA0003173157350000044
represents the p-th subset of training samples and
Figure BDA0003173157350000045
any sample with different represented pedestrian identities;
symbol | | | purple2A 2-norm representing a matrix;
[]+representing the ReLU operation, the calculation method is as follows: [ x ] of]+Max {0, x }, max being the max-valued operation;
step 3, training set XtrainInput to trained teacher model MtAnd untrained student model MsRespectively obtaining the same data set in the teacher model MtThe multi-dimensional feature matrix of the convolution network output
Figure BDA0003173157350000046
Model M for studentssThe multi-dimensional characteristic matrix of the simplified convolution network output
Figure BDA0003173157350000047
And in the teacher model MtMulti-dimensional feature matrix of full-connection network output
Figure BDA0003173157350000048
Model M for studentssSimplified full-connection network output multidimensional characteristic matrix
Figure BDA0003173157350000049
Multidimensional feature matrix
Figure BDA00031731573500000410
And a multi-dimensional feature matrix
Figure BDA00031731573500000411
The dimension of (a) is b × s × c × h × w; multidimensional feature matrix
Figure BDA00031731573500000412
And a multi-dimensional feature matrix
Figure BDA00031731573500000413
Dimension of (b) is b × n × d; wherein b represents the number of samples in each training sample subset; s represents the number of frames; c represents the number of convolution layer output characteristic matrixes, h represents the height of the convolution network output characteristic matrixes, w represents the width of the convolution network output characteristic matrixes, and d represents the characteristic moment of the full-connection network outputThe dimension of the array;
step 4. make the difference metric function Lc_disComputing a multidimensional feature matrix
Figure BDA00031731573500000414
And a multi-dimensional feature matrix
Figure BDA00031731573500000415
The difference between them; wherein the difference metric function Lc_disThe calculation formula of (a) is as follows:
Figure BDA00031731573500000416
in the formula, the difference metric function Lc_disRepresenting a loss of partial distillation, the symbol | | | | non-conducting phosphor2 FRepresents the F-norm of the matrix;
step 5, respectively calculating the multidimensional characteristic matrix by using the formula (3)
Figure BDA00031731573500000417
And a multi-dimensional feature matrix
Figure BDA00031731573500000418
The difference between each sample in (1) and the result thereof are respectively
Figure BDA0003173157350000051
And
Figure BDA0003173157350000052
represents;
Figure BDA0003173157350000053
wherein,
Figure BDA0003173157350000054
representing the distance between all samples of the same category in the feature matrix output by the teacher model;
Figure BDA0003173157350000055
representing the distance between all the samples of different types in the feature matrix output by the teacher model;
Figure BDA0003173157350000056
representing the distance between all samples of the same category in a feature matrix output by the student model;
Figure BDA0003173157350000057
representing the distance between all the samples of different types in the feature matrix output by the student model;
Figure BDA0003173157350000058
when the training teacher model is represented, the jth sample to be trained with the same category in the ith training sample subset;
Figure BDA0003173157350000059
when the training teacher model is represented, the kth sample to be trained with the same category in the pth training sample subset;
Figure BDA00031731573500000510
when the training teacher model is represented, the jth sample to be trained with different categories in the ith training sample subset is represented;
Figure BDA00031731573500000511
when the training teacher model is represented, the kth sample to be trained with different classes in the pth training sample subset;
Figure BDA00031731573500000512
when the training student model is expressed, the jth sample to be trained with the same category in the ith training sample subset is represented;
Figure BDA00031731573500000513
when the student model is trained, the kth sample to be trained with the same category in the pth training sample subset is represented;
Figure BDA00031731573500000514
when the training student model is represented, the ith training sample subset is the jth sample to be trained with different categories;
Figure BDA00031731573500000515
when the student model is trained, the kth sample to be trained with different classes in the pth training sample subset is represented;
step 6, using the triple loss function L shown in the formula (1)triRespectively calculate
Figure BDA00031731573500000516
And
Figure BDA00031731573500000517
triple loss of
Figure BDA00031731573500000518
And
Figure BDA00031731573500000519
the specific calculation formula is shown in formulas (4) and (5);
Figure BDA00031731573500000520
Figure BDA00031731573500000521
wherein,
Figure BDA00031731573500000522
representing the total number of sample pairs consisting of two samples with Euclidean distance not being 0 in a training sample subset in the teacher model;
wherein,
Figure BDA0003173157350000061
representing the total number of sample pairs consisting of two samples with Euclidean distance not being 0 in a training sample subset in the student model;
calculating the triplet loss using the Smooth L1 loss function in equation (6)
Figure BDA0003173157350000062
And
Figure BDA0003173157350000063
total distillation loss L off_dis
Figure BDA0003173157350000064
Step 7. partial distillation loss Lc_disTotal distillation loss Lf_disTriple loss
Figure BDA0003173157350000065
And
Figure BDA0003173157350000066
integrating and calculating to obtain the total loss LtotalThe specific calculation formula is as follows:
Figure BDA0003173157350000067
wherein, alpha is a distillation loss weight value;
step 8, setting a student model MsThe number of iterations is determined by selecting Adam optimizer and reducing the loss value LtotalTransferring the knowledge of the teacher model to the student model;
step 9, test set XtestThe pedestrian video sequence is input into the student network MsAnd (4) carrying out identification to obtain an identification result.
The invention has the following advantages:
as described above, the present invention describes a gait recognition model compression system based on local-global joint knowledge distillation algorithm, which designs a compact and lightweight gait model network (i.e. student model) based on deep separation convolution, wherein the convolution network in the model network only retains the backbone convolution network, and simplifies each convolution module in the backbone convolution network, specifically, 3 × 3 deep separation convolution layers and 1 × 1 point convolution layers are adopted to replace the standard convolution layers in the existing scheme; in addition, the model network adopts a 16-layer lightweight fully-connected network to replace a 31-layer fully-connected network in the existing scheme, so that the model parameters are greatly compressed, the calculation of the recognition model is simplified, and the recognition efficiency is improved. In addition, the invention also provides a local-overall joint knowledge distillation algorithm suitable for gait recognition tasks on the basis of the gait recognition model compression system, compared with the prior art, the local-overall joint knowledge distillation method designed by the invention simultaneously utilizes the local feature vector output by the convolution network and the global feature vector output by the full-connection network to carry out joint knowledge distillation, not only retains the local feature of the gait of the pedestrian by convolution operation, but also extracts the global feature of the gait of the pedestrian by the full-connection operation, increases the information content of knowledge distillation, improves the effect of the gait recognition of the pedestrian, and ensures the gait recognition accuracy of the model.
Drawings
FIG. 1 is a block diagram of a partial-global joint knowledge distillation algorithm based gait recognition model compression system according to the present invention;
FIG. 2 is a schematic diagram of a structure of aggregate pooling in an embodiment of the present invention;
FIG. 3 is a schematic diagram of a horizontal pyramid pooling configuration in an embodiment of the present invention;
FIG. 4 is a schematic diagram of a simplified horizontal pyramid pooling configuration in an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a multi-layer global channel according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of a deep separation convolution module according to an embodiment of the present invention;
FIG. 7 is a schematic flow chart of the gait recognition model compression method based on the local-global joint knowledge distillation algorithm.
Detailed Description
The invention is described in further detail below with reference to the following figures and detailed description:
example 1
The embodiment describes a gait recognition model compression system of a local-global joint knowledge distillation algorithm.
As shown in fig. 1, the gait recognition model compression system constructs two recognition models based on a deep neural network, which are respectively designated as a teacher model Mt with large capacity and a student model M with light weights
Where the symbol [ ] in fig. 1 represents addition by element.
The teacher model Mt is composed of a convolution network, an aggregation pooling structure, a horizontal pyramid pooling structure and a full-connection network.
The convolutional network consists of a backbone network and a plurality of layers of global channels.
The backbone network is composed of a first convolution module, a second convolution module and a third convolution module.
The first convolution module is comprised of three layers, wherein:
the first layer is a standard convolution layer, a convolution kernel of 5 multiplied by 5 is used, the input of the first layer is data of s (frame number) multiplied by 1 multiplied by 64 multiplied by 44, and the output is a characteristic diagram of s (frame number) multiplied by 32 multiplied by 64 multiplied by 44;
the second layer is a standard convolution layer, a convolution kernel of 3 multiplied by 3 is used, the second layer inputs data of s (frame number) multiplied by 32 multiplied by 64 multiplied by 44, and outputs a characteristic diagram of s (frame number) multiplied by 32 multiplied by 64 multiplied by 44;
the third layer is a pooling layer, and a maximum pooling layer with a pooling kernel of 2 × 2 and a step size of 2 is used, and the third layer has an input of s (number of frames) × 32 × 64 × 44 and an output of s (number of frames) × 32 × 22.
The second convolution module is comprised of three layers, wherein:
the first layer is a standard convolution layer, a convolution kernel of 3 multiplied by 3 is used, the input of the first layer is data of s (frame number) multiplied by 32 multiplied by 22, and a characteristic diagram of s (frame number) multiplied by 64 multiplied by 32 multiplied by 22 is output;
the second layer is a standard convolution layer, a convolution kernel of 3 multiplied by 3 is used, the input of the second layer is data of s (frame number) multiplied by 64 multiplied by 32 multiplied by 22, and a characteristic diagram of s (frame number) multiplied by 64 multiplied by 32 multiplied by 22 is output;
the third layer is a pooling layer, and a maximum pooling layer having a pooling kernel of 2 × 2 and a step size of 2 is used, and the third layer has an input of (number of frames) × 32 × 32 × 22 and an output of s (number of frames) × 32 × 16 × 11.
The third convolution module consists of two layers, wherein:
the first layer is a standard convolution layer, a convolution kernel of 3 x 3 is used, the input of the first layer is data of s (frame number) x 64 x 32 x 22, and the output is a characteristic diagram of s (frame number) x 128 x 16 x 11;
the second layer is a standard convolutional layer, and uses a 3 × 3 convolutional kernel, and the input of the second layer is data of s (number of frames) × 128 × 16 × 11, and a feature map of s (number of frames) × 128 × 16 × 11 is output.
As shown in fig. 5, the multi-layered global channel is composed of a fourth convolution module and a fifth convolution module.
The fourth convolution module consists of three layers, wherein:
the first layer is a standard convolution layer, a convolution kernel of 3 multiplied by 3 is used, the input of the first layer is data of 32 multiplied by 22, and a feature map of 64 multiplied by 32 is output;
the second layer is a standard convolution layer, a 3 × 3 convolution kernel is used, the input of the second layer is data of 64 × 32 × 22, and a feature map of 64 × 32 × 22 is output;
the third layer is a pooling layer, and a 2 × 2 pooling kernel is used to input 64 × 32 × 22 data and output a 64 × 16 × 11 feature map.
The fifth convolution module consists of two layers, wherein:
the first layer is a standard convolution layer, a 3 x 3 convolution kernel is used, the input of the first layer is data of 64 x 16 x 11, and the output is a feature map of 128 x 16 x 11;
the second layer is a standard convolutional layer, a 3 × 3 convolutional kernel is used, the input of the second layer is 128 × 16 × 11 data, and the output is a 128 × 16 × 11 feature map.
There are four pooling structures in the teacher model Mt, two horizontal pyramid pooling structures, and one full-connection network.
The four set pooling are defined as a first, second, third, and fourth set pooling structure, respectively.
One of the aggregate pooling structures is illustrated in fig. 2.
The input of one set of pooling is a feature matrix corresponding to s video frames, each dimension is 128 × 16 × 11, the output is a processed single feature matrix, the dimension is 128 × 16 × 11, and the value of each element in the feature matrix is the maximum value element extracted from the corresponding position of the feature matrices by taking the maximum value operation.
The characteristic of the set pooling is that the sequence of the feature matrixes corresponding to a plurality of input video frames can be randomly arranged.
The two horizontal pyramid pooling structures are respectively defined as a first horizontal pyramid pooling structure and a second horizontal pyramid pooling structure, and one of the horizontal pyramid pooling structures is taken as an example, as shown in fig. 3.
One horizontal pyramid pooling input is a feature matrix of 128 × 16 × 11, the matrix is decomposed according to 5 scales to obtain intermediate feature matrices, which are respectively 1 feature matrix of 128 × 16 × 11 in dimension, 2 feature matrices of 128 × 8 × 11 in dimension, 4 feature matrices of 128 × 4 × 11 in dimension, 8 feature matrices of 128 × 2 × 11 in dimension, and 16 feature matrices of 128 × 1 × 11 in dimension, and the total number of 31 feature matrices is obtained.
The invention adopts global maximum pooling operation and global average pooling operation to compress the second dimension and the third dimension of each intermediate feature matrix into 128-dimension vectors. The above process is illustrated by way of example:
when the global maximum pooling operation is carried out on the 128 x 16 x 11 feature matrix, decomposing the matrix into 128 x 11 sub-matrixes, calculating the maximum value of each 16 x 11 sub-matrix, obtaining 128 maximum value calculation results in total, and combining the 128 maximum value calculation results into a 128-dimensional output feature vector; similarly, when performing global average pooling on a 128 × 16 × 11 feature matrix, the matrix is decomposed into 128 × 11 sub-matrices, an average value of each 16 × 11 sub-matrix is calculated, 128 average value calculation results are obtained in total, and the 128 average value calculation results are combined into a 128-dimensional output feature vector.
The final output of one horizontal pyramid pooling structure is 31 128-dimensional vectors.
The fully connected network comprises a first fully connected sub-network and a second fully connected sub-network; the first fully-connected sub-network and the second fully-connected sub-network respectively comprise 31 independent fully-connected neural network layers.
The fully connected network contains a total of 62 fully connected layers, each layer inputting 128 data and outputting 256 features.
The outputs of the first fully-connected sub-network and the second fully-connected sub-network are taken as the outputs of the teacher model Mt.
The output of the first convolution module is connected to the input of the second convolution module. The output of the first convolution module is also connected to the input of the fourth convolution module through a first aggregate pooling structure.
The output of the second convolution module is connected to the input of the third convolution module. And the output of the second convolution module is connected with the input of the second set pooling structure, and the output of the second set pooling structure is added with the output of the fourth convolution module at the corresponding position and then connected with the input of the fifth convolution module.
And the output of the third convolution module is connected with the input of the third set pooling structure, and the output of the third set pooling structure is added with the output of the fifth convolution module at the corresponding position and then connected with the input of the first horizontal pyramid pooling structure.
The output of the third convolution module is also connected to the input of the second horizontal pyramid pooling structure via a fourth set pooling structure.
The output of the first horizontal pyramid pooling structure is connected to the input of the first fully-connected sub-network; the output of the second horizontal pyramid pooling structure is connected to the input of a second fully-connected sub-network.
Student model MsThe system consists of a simplified convolutional network, a set pooling structure, a simplified horizontal pyramid pooling structure and a simplified full-connection network.
The reduced convolutional network only comprises one backbone convolutional network, and specifically, as shown in fig. 1, the reduced convolutional network comprises a sixth convolutional module, a seventh convolutional module, and an eighth convolutional module.
Compared with a first convolution module, a second convolution module and a third convolution module in a backbone network in a teacher model, the sixth convolution module, the seventh convolution module and the eighth convolution module are simplified respectively.
Wherein the sixth convolution module deletes a standard convolution layer having a convolution kernel of 3 × 3 compared to the first convolution module.
The seventh convolution module replaces the two layers of standard convolution layers in the second convolution module with a depth-separated convolution layer with a convolution kernel of 3 x 3 and a point convolution with a convolution kernel of 1 x 1, respectively.
Similarly, the eighth convolution module replaces the two standard convolution layers in the third convolution module with a depth-separated convolution layer with convolution kernel 3 × 3 and a point convolution with convolution kernel 1 × 1, respectively.
Specifically, the sixth convolution module is composed of two layers, wherein:
the first layer is a standard convolution layer, a convolution kernel of 5 multiplied by 5 is used, the input of the first layer is data of s (frame number) multiplied by 1 multiplied by 64 multiplied by 44, and the output is a characteristic diagram of s (frame number) multiplied by 32 multiplied by 64 multiplied by 44;
the second layer is a pooling layer, and a maximum pooling layer having a pooling kernel of 2 × 2 and a step size of 2 is used, and the second layer has a characteristic map of 32 × 32 × 22 as input and output of data of 32 × 64 × 44.
The seventh convolution module consists of three layers, wherein:
the first layer is a deep separation convolution layer, a convolution kernel of 5 multiplied by 5 is used, the input of the first layer is data of s (frame number) multiplied by 32 multiplied by 22, and a characteristic diagram of s (frame number) multiplied by 32 multiplied by 22 is output;
the second layer is a point convolution layer, a convolution kernel of 1 multiplied by 1 is used, the input of the second layer is data of s (frame number) multiplied by 64 multiplied by 32 multiplied by 22, and a characteristic diagram of s (frame number) multiplied by 64 multiplied by 32 multiplied by 22 is output;
the third layer is a pooling layer, a maximum pooling layer with a pooling core of 2 × 2 and a step length of 2 is adopted, the third layer inputs data of 64 × 32 × 22 and outputs a feature map of 64 × 16 × 11;
the eighth convolution module is comprised of two layers, wherein:
the first layer is a depth separation convolution layer, a convolution kernel of 3 x 3 is used, the input of the first layer is data of s (frame number) × 64 x 16 x 11, and a characteristic diagram of s (frame number) × 64 x 16 x 11 is output;
the second layer is a dot convolution layer, and uses a 1 × 1 convolution kernel, and the input of the second layer is data of s (frame number) × 64 × 16 × 11, and a feature map of s (frame number) × 128 × 16 × 11 is output.
The structure of the deep separation convolutional layer is shown in fig. 6, and the structure is a known structure and will not be described in detail.
The sixth convolution module, the seventh convolution module and the eighth convolution module are connected in sequence.
In a preferred embodiment, a dot convolution layer is further added in front of the first depth-separation convolution layer of the seventh convolution module, and similarly, a dot convolution layer is further added in front of the first depth-separation convolution layer of the eighth convolution module.
The design realizes the improvement of the network performance of the lightweight student model on the premise of hardly changing the model capacity.
Defining a student model MsThe middle collection pooling structure is a fifth collection pooling structure; the output of the eighth convolution module is connected to the input of the reduced horizontal pyramid pooling structure via a fifth set pooling structure.
The fifth set pooling structure is also composed of a statistical (maximum) function, which has as input s (number of frames) x 128 x 16 x 11 feature matrix and outputs a 128 x 16 x 11 feature matrix.
The reduced horizontal pyramid pooling structure is composed of a global maximum pooling and a global average pooling, and the structure is shown in fig. 4.
The input of the simplified horizontal golden sub-tower pooling is a 128 multiplied by 16 multiplied by 11 feature matrix, the middle feature matrix is 16 three-dimensional matrices of 128 multiplied by 1 multiplied by 11, and 16 feature vectors of 128 dimensions are output through global maximum pooling and global average pooling.
The output of the simplified horizontal pyramid pooling structure is connected with the input of a simplified fully-connected network, the simplified fully-connected network comprises 16 independent fully-connected neural network layers, the input of each layer is a 128-dimensional vector, and the output is a 128-dimensional vector.
Compared with the conventional gait recognition model based on the standard convolution layer structure, the invention designs a compact lightweight gait recognition model (called as a student model) by adopting low-cost deep separation convolution, thereby reducing the parameter number of the model structurally.
Example 2
The embodiment describes a gait recognition model compression method based on a local-global joint knowledge distillation algorithm, which is based on the gait recognition model compression system based on the local-global joint knowledge distillation algorithm in the embodiment 1.
In this embodiment 2, the purpose of gait recognition is achieved by training the two models in the above embodiment 1.
As shown in fig. 7, the gait recognition model compression method based on the local-global joint knowledge distillation algorithm includes the following steps:
step 1, extracting a gait contour sequence in the gait video by using a background subtraction method (the method is a conventional method), and uniformly cutting the gait contour sequence into a picture set with the same size, such as an image set of 64 x 64 pixels.
The image sets form a data set X, and the data set X is divided into a training set XtrainAnd test set Xtest
Taking a deep convolutional neural network as a basic structure, constructing two gait recognition models which are respectively recorded as a large-capacity teacher model MtAnd a lightweight student model MsThe model structure is described in embodiment 1 above, and is not described here.
Step 2. Using training set XtrainTeacher training model Mt(ii) a The learning rate was set to 0.0001, the number of iterations was 80000, and the optimizer used an Adam optimizer that entered 16 sequences of 8 objects per iteration (128 sequences total, each sequence randomly selected 30 frames of images, scaled to a size of 64 × 44 pixels).
The loss function is as given in equation (1)Shown triplet loss function Ltri
Figure BDA0003173157350000111
In the formula, Ntri+Representing the total number of sample pairs consisting of two samples with Euclidean distance not being 0 in a training sample subset; the training sample subset is a training set X from each trainingtrainA set consisting of a plurality of randomly selected sample images;
n represents the number of the fully-connected neural network layers of the teacher network, and t represents the serial number of the fully-connected neural network layers of the teacher network;
p represents the number of pedestrians contained in each subset of training samples;
i and p respectively represent the serial numbers of the pedestrian samples to be trained in each training sample subset;
k represents the number of video sequences of each pedestrian in each subset of training samples;
a. j and k respectively represent the serial numbers of the pedestrian video sequences in each training sample subset;
m represents a boundary threshold of the loss function;
Figure BDA0003173157350000112
representing the jth sample to be trained in the ith training sample subset;
Figure BDA0003173157350000113
represents the ith subset of training samples
Figure BDA0003173157350000114
Any sample with the same identity of the represented pedestrian;
Figure BDA0003173157350000115
represents the p-th subset of training samples and
Figure BDA0003173157350000116
any sample with different represented pedestrian identities;
symbol | | | purple2A 2-norm representing a matrix;
[]+representing the ReLU operation, the calculation method is as follows: [ x ] of]+Max {0, x }, max is the max-valued operation.
By reducing the loss value, the points of the same object are made to approach each other, while the points of different objects are made to depart from each other, the boundary threshold M of the loss function is taken to be 0.2, and the training target is to make the teacher model MtThe recognition performance of (2) is the best.
Step 3, training set XtrainInput to trained teacher model MtAnd untrained student model MsRespectively obtaining the same data set in the teacher model MtThe multi-dimensional feature matrix of the convolution network output
Figure BDA0003173157350000121
Model M for studentssThe multi-dimensional characteristic matrix of the simplified convolution network output
Figure BDA0003173157350000122
In teacher model MtMulti-dimensional feature matrix of full-connection network output
Figure BDA0003173157350000123
And in the student model MsSimplified full-connection network output multidimensional characteristic matrix
Figure BDA0003173157350000124
Multidimensional feature matrix
Figure BDA0003173157350000125
And a multi-dimensional feature matrix
Figure BDA0003173157350000126
The dimension of (a) is b × s × c × h × w; multidimensional feature matrix
Figure BDA0003173157350000127
And a multi-dimensional feature matrix
Figure BDA0003173157350000128
Dimension of (b) is b × n × d; wherein b represents the number of samples in each training sample subset; s represents the number of frames; c represents the number of convolution layer output characteristic matrixes, h represents the height of the convolution network output characteristic matrixes, w represents the width of the convolution network output characteristic matrixes, and d represents the dimension of the characteristic matrixes output by the fully-connected network.
Step 4. make the difference metric function Lc_disComputing a multidimensional feature matrix
Figure BDA0003173157350000129
And a multi-dimensional feature matrix
Figure BDA00031731573500001210
The difference between them; wherein the difference metric function Lc_disThe calculation formula of (a) is as follows:
Figure BDA00031731573500001211
in the formula, the difference metric function Lc_disRepresenting a loss of partial distillation, the symbol | | | | non-conducting phosphor2 FRepresenting the F-norm of the matrix.
When calculating the similarity matrix differences of the output features, the difference of the feature matrix is further normalized by using an L2 regularization method, so that the learning of the student model from the teacher model can be guided to be more effective.
Under this approach, the teacher model MtAnd student model MsThe number of channels of the convolution output features may not remain consistent, that is, a larger or smaller capacity model may be designed for knowledge distillation.
Step 5, respectively calculating the multidimensional characteristic matrix by using the formula (3)
Figure BDA00031731573500001212
And a multi-dimensional feature matrix
Figure BDA00031731573500001213
The difference between each sample in (1) and the result thereof are respectively
Figure BDA00031731573500001214
And
Figure BDA00031731573500001215
represents;
Figure BDA00031731573500001216
wherein,
Figure BDA00031731573500001217
representing the distance between all samples of the same category in the feature matrix output by the teacher model;
Figure BDA00031731573500001218
representing the distance between all the samples of different types in the feature matrix output by the teacher model;
Figure BDA0003173157350000131
representing the distance between all samples of the same category in a feature matrix output by the student model;
Figure BDA0003173157350000132
and the distance between all the samples in different categories in the feature matrix output by the student model is represented. The same-class samples refer to picture samples with the same pedestrian identity labels, and the different-class samples refer to picture samples with different pedestrian identity labels.
Figure BDA0003173157350000133
When representing the teacher model, the jth training sample with the same category in the ith training sample subsetA sample to be trained;
Figure BDA0003173157350000134
when the training teacher model is represented, the kth sample to be trained with the same category in the pth training sample subset;
Figure BDA0003173157350000135
when the training teacher model is represented, the jth sample to be trained with different categories in the ith training sample subset is represented;
Figure BDA0003173157350000136
when the training teacher model is represented, the kth sample to be trained with different classes in the pth training sample subset;
Figure BDA0003173157350000137
when the training student model is expressed, the jth sample to be trained with the same category in the ith training sample subset is represented;
Figure BDA0003173157350000138
when the student model is trained, the kth sample to be trained with the same category in the pth training sample subset is represented;
Figure BDA0003173157350000139
when the training student model is represented, the ith training sample subset is the jth sample to be trained with different categories;
Figure BDA00031731573500001310
and when the training student model is shown, the kth sample to be trained with different classes in the pth training sample subset.
Step 6, using the triple loss function L shown in the formula (1)triRespectively calculate
Figure BDA00031731573500001311
And
Figure BDA00031731573500001312
triple loss of
Figure BDA00031731573500001313
And
Figure BDA00031731573500001314
the specific calculation formula is shown in formulas (4) and (5);
Figure BDA00031731573500001315
Figure BDA00031731573500001316
where m represents the boundary threshold of the loss function, set to 0.2;
Figure BDA00031731573500001317
representing the total number of sample pairs consisting of two samples with Euclidean distance not being 0 in a training sample subset in the teacher model;
wherein,
Figure BDA00031731573500001318
representing the total number of sample pairs consisting of two samples with Euclidean distance not being 0 in a training sample subset in the student model;
calculating the triplet loss using the Smooth L1 loss function in equation (6)
Figure BDA00031731573500001319
And
Figure BDA00031731573500001320
total distillation loss L off_dis
Figure BDA00031731573500001321
Step 7. partial distillation loss Lc_disTotal distillation loss Lf_disTriple loss
Figure BDA0003173157350000141
And
Figure BDA0003173157350000142
integrating and calculating to obtain the total loss LtotalThe specific calculation formula is as follows:
Figure BDA0003173157350000143
wherein alpha is a distillation loss weight value.
Step 8, setting a student model MsThe iteration number is 30000 times, the optimizer selects Adam optimizer, the previous 10000 times learning rate is set to 0.005, the later 20000 times learning rate is set to 0.001, the Adam optimizer selects Adam optimizer, and loss value L is reducedtotalTransferring the knowledge of the teacher model to the student model MsIn addition, the recognition performance of the student model is improved.
Step 9, test set XtestThe pedestrian video sequence is input into the student network MsAnd (4) carrying out identification to obtain an identification result.
As can be seen from the process of the gait recognition model compression method, the invention utilizes the joint knowledge distillation algorithm to carry out local and overall knowledge distillation on the large-capacity gait recognition model (called as a teacher model), can guide the student model to learn more knowledge from the teacher model, and can reduce the model capacity and maintain the original recognition effect as much as possible.
The method of the invention adopts the lightweight model compression and the joint knowledge distillation technology, so that the gait recognition accuracy of the model can be effectively ensured while the scale of the model parameters is reduced, thereby reducing the operation cost, reducing the training times and the reasoning time, improving the efficiency of the model, and being more suitable for the actual scene with high real-time requirement and large data volume.
Experiments prove that compared with a deep neural network model in the prior art, the gait recognition model compression method has the advantages that the number of model parameters is reduced by 9 times, the calculated amount is reduced by 19 times, the performance of the model in the public data set CASIA-B is reduced by only 2.2%, the actual reasoning time is shortened, and the problem of model efficiency is effectively solved.
It should be understood, however, that the description herein of specific embodiments is not intended to limit the invention to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims.

Claims (2)

1. A gait recognition model compression system based on a local-integral joint knowledge distillation algorithm is characterized in that,
comprises a teacher model Mt and a student model Ms(ii) a Wherein:
the teacher model Mt consists of a convolution network, an aggregation pooling structure, a horizontal pyramid pooling structure and a full-connection network;
the convolution network consists of a backbone network and a plurality of layers of global channels;
the backbone network consists of a first convolution module, a second convolution module and a third convolution module;
the first convolution module is comprised of three layers, wherein:
the first layer is a standard convolutional layer, using a 5 × 5 convolutional kernel; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the third layer is a pooling layer, and a largest pooling layer with a pooling core of 2 multiplied by 2 and a step length of 2 is adopted;
the second convolution module is comprised of three layers, wherein:
the first layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the third layer is a pooling layer, and a maximum pooling layer with a pooling core of 2 × 2 and a step length of 2 is used;
the third convolution module consists of two layers, wherein:
the first layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3;
the multilayer global channel consists of a fourth convolution module and a fifth convolution module;
the fourth convolution module consists of three layers, wherein: the first layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the third layer is a pooling layer, and 2 multiplied by 2 pooling cores are adopted;
the fifth convolution module consists of two layers, wherein:
the first layer is a standard convolutional layer, using a convolution kernel of 3 × 3; the second layer is a standard convolutional layer, using a convolution kernel of 3 × 3;
four collection pooling structures are arranged in the teacher model Mt, two horizontal pyramid pooling structures are arranged, and one full-connection network is arranged;
defining four collection pooling structures as a first, a second and a third collection pooling structure respectively; defining two horizontal pyramid pooling structures as a first horizontal pyramid pooling structure and a second horizontal pyramid pooling structure respectively;
the fully connected network comprises a first fully connected sub-network and a second fully connected sub-network;
the output of the first convolution module is connected with the input of the second convolution module;
the output of the first convolution module is also connected with the input of the fourth convolution module through the first set pooling structure;
the output of the second convolution module is connected with the input of the third convolution module;
the output of the second convolution module is connected with the input of the second set pooling structure, and the output of the second set pooling structure is added with the output of the fourth convolution module at the corresponding position and then connected with the input of the fifth convolution module;
the output of the third convolution module is connected with the input of the third collection pooling structure, and the output of the third collection pooling structure is added with the output of the fifth convolution module at corresponding positions and then connected with the input of the first horizontal pyramid pooling structure;
the output of the third convolution module is also connected with the input of the second horizontal pyramid pooling structure through a fourth set pooling structure;
the output of the first horizontal pyramid pooling structure is connected to the input of the first fully-connected sub-network;
the output of the second horizontal pyramid pooling structure is connected to the input of a second fully-connected sub-network;
the outputs of the first and second fully-connected sub-networks are used as teacher models MtAn output of (d);
the first horizontal pyramid pooling structure and the second horizontal pyramid pooling structure have five scales;
the first fully-connected sub-network and the second fully-connected sub-network respectively comprise 31 independent fully-connected neural network layers;
student model MsThe system consists of a simplified convolutional network, a set pooling structure, a simplified horizontal pyramid pooling structure and a simplified full-connection network, wherein:
the simplified convolution network consists of a sixth convolution module, a seventh convolution module and an eighth convolution module;
the sixth convolution module consists of two layers, wherein: the first layer is a standard convolutional layer, using a 5 × 5 convolutional kernel; the second layer is a pooling layer, and a maximum pooling layer with a pooling core of 2 × 2 and a step length of 2 is used;
the seventh convolution module consists of three layers, wherein:
the first layer is a deep separation convolutional layer, using a convolution kernel of 5 × 5; the second layer is a point convolution layer, using a convolution kernel of 1 × 1; the third layer is a pooling layer, and a largest pooling layer with a pooling core of 2 multiplied by 2 and a step length of 2 is adopted;
the eighth convolution module is comprised of two layers, wherein:
the first layer uses a 3 × 3 convolution kernel for the depth-separated convolution layer; the second layer is a point convolution layer, using a convolution kernel of 1 × 1;
the sixth convolution module, the seventh convolution module and the eighth convolution module are connected in sequence;
defining a student model MsThe middle set pooling structure is a fifth set pooling nodeStructuring;
the output of the eighth convolution module is connected with the input of the reduced horizontal pyramid pooling structure through a fifth set pooling structure;
the output of the simplified horizontal pyramid pooling structure is connected with the input of the simplified fully-connected network;
the simplified horizontal pyramid pooling structure has only one scale; the simplified fully-connected network comprises 16 independent fully-connected neural network layers; the output of the streamlined fully-connected network is used as a student model MsTo output of (c).
2. A gait recognition model compression method based on a local-global joint knowledge distillation algorithm, a gait recognition model compression system based on the local-global joint knowledge distillation algorithm of claim 1; it is characterized in that the preparation method is characterized in that,
the gait recognition model compression method comprises the following steps:
step 1, extracting a gait contour sequence in a gait video by using a background subtraction method, uniformly cutting the gait contour sequence into picture sets with the same size to form a data set X, and dividing the data set X into a training set XtrainAnd test set Xtest
Step 2. Using training set XtrainTeacher training model MtSetting learning rate and iteration times, wherein an Adam optimizer is adopted as the optimizer, and a triple loss function L shown in formula (1) is adopted as the loss functiontri
Figure FDA0003173157340000021
In the formula, Ntri+Representing the total number of sample pairs consisting of two samples with Euclidean distance not being 0 in a training sample subset; the training sample subset is a training set X from each trainingtrainA set consisting of a plurality of randomly selected sample images;
n represents the number of the fully-connected neural network layers of the teacher network, and t represents the serial number of the fully-connected neural network layers of the teacher network;
p represents the number of pedestrians contained in each subset of training samples;
i and p respectively represent the serial numbers of the pedestrian samples to be trained in each training sample subset;
k represents the number of video sequences of each pedestrian in each subset of training samples;
a. j and k respectively represent the serial numbers of the pedestrian video sequences in each training sample subset;
m represents a boundary threshold of the loss function;
Figure FDA0003173157340000031
representing the jth sample to be trained in the ith training sample subset;
Figure FDA0003173157340000032
represents the ith subset of training samples
Figure FDA0003173157340000033
Any sample with the same identity of the represented pedestrian;
Figure FDA0003173157340000034
represents the p-th subset of training samples and
Figure FDA0003173157340000035
any sample with different represented pedestrian identities;
symbol | | | purple2A 2-norm representing a matrix;
[]+representing the ReLU operation, the calculation method is as follows: [ x ] of]+Max {0, x }, max being the max-valued operation;
step 3, training set XtrainInput to trained teacher model MtAnd untrained student model MsRespectively obtaining the same data set in the teacher model MtThe multi-dimensional feature matrix of the convolution network output
Figure FDA0003173157340000036
Model M for studentssThe multi-dimensional characteristic matrix of the simplified convolution network output
Figure FDA00031731573400000320
And in the teacher model MtMulti-dimensional feature matrix of full-connection network output
Figure FDA0003173157340000037
Model M for studentssSimplified full-connection network output multidimensional characteristic matrix
Figure FDA0003173157340000038
Multidimensional feature matrix
Figure FDA0003173157340000039
And a multi-dimensional feature matrix
Figure FDA00031731573400000310
The dimension of (a) is b × s × c × h × w; multidimensional feature matrix
Figure FDA00031731573400000311
And a multi-dimensional feature matrix
Figure FDA00031731573400000312
Dimension of (b) is b × n × d; wherein b represents the number of samples in each training sample subset; s represents the number of frames; c represents the number of the convolution layer output characteristic matrixes, h represents the height of the convolution network output characteristic matrixes, w represents the width of the convolution network output characteristic matrixes, and d represents the dimensionality of the characteristic matrixes output by the fully-connected network;
step 4. make the difference metric function Lc_disComputing a multidimensional feature matrix
Figure FDA00031731573400000313
And a multi-dimensional feature matrix
Figure FDA00031731573400000314
The difference between them; wherein the difference metric function Lc_disThe calculation formula of (a) is as follows:
Figure FDA00031731573400000315
in the formula, the difference metric function Lc_disRepresenting a loss of partial distillation, symbol
Figure FDA00031731573400000316
Represents the F-norm of the matrix;
step 5, respectively calculating the multidimensional characteristic matrix by using the formula (3)
Figure FDA00031731573400000317
And a multi-dimensional feature matrix
Figure FDA00031731573400000318
The difference between each sample in (1) and the result thereof are respectively
Figure FDA00031731573400000319
Represents;
Figure FDA0003173157340000041
wherein,
Figure FDA0003173157340000042
representing the distance between all samples of the same category in the feature matrix output by the teacher model;
Figure FDA0003173157340000043
representing teacher model outputDistances between all the different types of samples in the feature matrix;
Figure FDA0003173157340000044
representing the distance between all samples of the same category in a feature matrix output by the student model;
Figure FDA0003173157340000045
representing the distance between all the samples of different types in the feature matrix output by the student model;
Figure FDA0003173157340000046
when the training teacher model is represented, the jth sample to be trained with the same category in the ith training sample subset;
Figure FDA0003173157340000047
when the training teacher model is represented, the kth sample to be trained with the same category in the pth training sample subset;
Figure FDA0003173157340000048
when the training teacher model is represented, the jth sample to be trained with different categories in the ith training sample subset is represented;
Figure FDA0003173157340000049
when the training teacher model is represented, the kth sample to be trained with different classes in the pth training sample subset;
Figure FDA00031731573400000410
when the training student model is expressed, the jth sample to be trained with the same category in the ith training sample subset is represented;
Figure FDA00031731573400000411
when the student model is trained, the kth sample to be trained with the same category in the pth training sample subset is represented;
Figure FDA00031731573400000412
when the training student model is represented, the ith training sample subset is the jth sample to be trained with different categories;
Figure FDA00031731573400000413
when the student model is trained, the kth sample to be trained with different classes in the pth training sample subset is represented;
step 6, using the triple loss function L shown in the formula (1)triRespectively calculate
Figure FDA00031731573400000414
And
Figure FDA00031731573400000415
triple loss of
Figure FDA00031731573400000416
And
Figure FDA00031731573400000417
the specific calculation formula is shown in formulas (4) and (5);
Figure FDA00031731573400000418
Figure FDA00031731573400000419
wherein,
Figure FDA00031731573400000420
representing the total number of sample pairs consisting of two samples with Euclidean distance not being 0 in a training sample subset in the teacher model;
wherein,
Figure FDA00031731573400000421
representing the total number of sample pairs consisting of two samples with Euclidean distance not being 0 in a training sample subset in the student model;
calculating the triplet loss using the Smooth L1 loss function in equation (6)
Figure FDA0003173157340000055
And
Figure FDA0003173157340000056
total distillation loss L off_dis
Figure FDA0003173157340000051
Step 7. partial distillation loss Lc_disTotal distillation loss Lf_disTriple loss
Figure FDA0003173157340000052
And
Figure FDA0003173157340000053
integrating and calculating to obtain the total loss LtotalThe specific calculation formula is as follows:
Figure FDA0003173157340000054
wherein, alpha is a distillation loss weight value;
step 8, setting a student model MsThe number of iterations is determined by selecting Adam optimizer and reducing the loss value LtotalTransferring the knowledge of the teacher model to the student model;
step 9, test set XtestThe pedestrian video sequence is input into the student network MsAnd (4) carrying out identification to obtain an identification result.
CN202110824459.3A 2021-07-21 2021-07-21 Gait recognition model compression system and method based on local-integral combined knowledge distillation algorithm Active CN113505719B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110824459.3A CN113505719B (en) 2021-07-21 2021-07-21 Gait recognition model compression system and method based on local-integral combined knowledge distillation algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110824459.3A CN113505719B (en) 2021-07-21 2021-07-21 Gait recognition model compression system and method based on local-integral combined knowledge distillation algorithm

Publications (2)

Publication Number Publication Date
CN113505719A true CN113505719A (en) 2021-10-15
CN113505719B CN113505719B (en) 2023-11-24

Family

ID=78014088

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110824459.3A Active CN113505719B (en) 2021-07-21 2021-07-21 Gait recognition model compression system and method based on local-integral combined knowledge distillation algorithm

Country Status (1)

Country Link
CN (1) CN113505719B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116246349A (en) * 2023-05-06 2023-06-09 山东科技大学 Single-source domain generalization gait recognition method based on progressive subdomain mining
CN116824640A (en) * 2023-08-28 2023-09-29 江南大学 Leg identification method, system, medium and equipment based on MT and three-dimensional residual error network
CN117237984A (en) * 2023-08-31 2023-12-15 江南大学 MT leg identification method, system, medium and equipment based on label consistency

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108596039A (en) * 2018-03-29 2018-09-28 南京邮电大学 A kind of bimodal emotion recognition method and system based on 3D convolutional neural networks
CN108764462A (en) * 2018-05-29 2018-11-06 成都视观天下科技有限公司 A kind of convolutional neural networks optimization method of knowledge based distillation
CN109034219A (en) * 2018-07-12 2018-12-18 上海商汤智能科技有限公司 Multi-tag class prediction method and device, electronic equipment and the storage medium of image
CN110097084A (en) * 2019-04-03 2019-08-06 浙江大学 Pass through the knowledge fusion method of projection feature training multitask student network
CN112560631A (en) * 2020-12-09 2021-03-26 昆明理工大学 Knowledge distillation-based pedestrian re-identification method
CN112784964A (en) * 2021-01-27 2021-05-11 西安电子科技大学 Image classification method based on bridging knowledge distillation convolution neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108596039A (en) * 2018-03-29 2018-09-28 南京邮电大学 A kind of bimodal emotion recognition method and system based on 3D convolutional neural networks
CN108764462A (en) * 2018-05-29 2018-11-06 成都视观天下科技有限公司 A kind of convolutional neural networks optimization method of knowledge based distillation
CN109034219A (en) * 2018-07-12 2018-12-18 上海商汤智能科技有限公司 Multi-tag class prediction method and device, electronic equipment and the storage medium of image
CN110097084A (en) * 2019-04-03 2019-08-06 浙江大学 Pass through the knowledge fusion method of projection feature training multitask student network
CN112560631A (en) * 2020-12-09 2021-03-26 昆明理工大学 Knowledge distillation-based pedestrian re-identification method
CN112784964A (en) * 2021-01-27 2021-05-11 西安电子科技大学 Image classification method based on bridging knowledge distillation convolution neural network

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116246349A (en) * 2023-05-06 2023-06-09 山东科技大学 Single-source domain generalization gait recognition method based on progressive subdomain mining
CN116246349B (en) * 2023-05-06 2023-08-15 山东科技大学 Single-source domain generalization gait recognition method based on progressive subdomain mining
CN116824640A (en) * 2023-08-28 2023-09-29 江南大学 Leg identification method, system, medium and equipment based on MT and three-dimensional residual error network
CN116824640B (en) * 2023-08-28 2023-12-01 江南大学 Leg identification method, system, medium and equipment based on MT and three-dimensional residual error network
CN117237984A (en) * 2023-08-31 2023-12-15 江南大学 MT leg identification method, system, medium and equipment based on label consistency
CN117237984B (en) * 2023-08-31 2024-06-21 江南大学 MT leg identification method, system, medium and equipment based on label consistency

Also Published As

Publication number Publication date
CN113505719B (en) 2023-11-24

Similar Documents

Publication Publication Date Title
CN107273800B (en) Attention mechanism-based motion recognition method for convolutional recurrent neural network
CN108154194B (en) Method for extracting high-dimensional features by using tensor-based convolutional network
CN111814661B (en) Human body behavior recognition method based on residual error-circulating neural network
CN106778604B (en) Pedestrian re-identification method based on matching convolutional neural network
CN113239784B (en) Pedestrian re-identification system and method based on space sequence feature learning
CN105956560B (en) A kind of model recognizing method based on the multiple dimensioned depth convolution feature of pondization
CN113505719B (en) Gait recognition model compression system and method based on local-integral combined knowledge distillation algorithm
CN111325165B (en) Urban remote sensing image scene classification method considering spatial relationship information
CN110414432A (en) Training method, object identifying method and the corresponding device of Object identifying model
CN108830157A (en) Human bodys' response method based on attention mechanism and 3D convolutional neural networks
CN107451565B (en) Semi-supervised small sample deep learning image mode classification and identification method
CN110728183A (en) Human body action recognition method based on attention mechanism neural network
CN108090472A (en) Pedestrian based on multichannel uniformity feature recognition methods and its system again
WO2022227292A1 (en) Action recognition method
CN113920581A (en) Method for recognizing motion in video by using space-time convolution attention network
CN113505856B (en) Non-supervision self-adaptive classification method for hyperspectral images
CN114419732A (en) HRNet human body posture identification method based on attention mechanism optimization
CN111881716A (en) Pedestrian re-identification method based on multi-view-angle generation countermeasure network
CN116543269B (en) Cross-domain small sample fine granularity image recognition method based on self-supervision and model thereof
CN115731579A (en) Terrestrial animal individual identification method based on cross attention transducer network
CN116246338B (en) Behavior recognition method based on graph convolution and transducer composite neural network
CN112396036A (en) Method for re-identifying blocked pedestrians by combining space transformation network and multi-scale feature extraction
CN114612718B (en) Small sample image classification method based on graph structural feature fusion
CN111881794B (en) Video behavior recognition method and system
CN111325251A (en) Simple stroke recognition method based on convolutional neural fuzzy network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant