CN116246075A

CN116246075A - Video semantic segmentation method combining dynamic information and static information

Info

Publication number: CN116246075A
Application number: CN202310536770.7A
Authority: CN
Inventors: 余锋; 李会引; 姜明华; 汤光裕; 刘莉; 周昌龙; 宋坤芳
Original assignee: Wuhan Textile University
Current assignee: Wuhan Textile University
Priority date: 2023-05-12
Filing date: 2023-05-12
Publication date: 2023-06-09
Anticipated expiration: 2043-05-12
Also published as: CN116246075B

Abstract

The invention discloses a video semantic segmentation method combining dynamic information and static information, which comprises the following steps: firstly, a video semantic segmentation network with dynamic and static information fusion is constructed, then a loss function is designed, a video semantic segmentation model is obtained through training on a video semantic segmentation data set, and finally, intelligent segmentation of a video is realized by using the model. The invention improves the average merging ratio of video segmentation by improving the video semantic segmentation model and the loss function, provides a high-precision video semantic segmentation network construction strategy, provides references for the intellectualization of video segmentation, and greatly saves labor cost.

Description

Video semantic segmentation method combining dynamic information and static information

Technical Field

The invention relates to the field of video semantic segmentation, and in particular relates to a video semantic segmentation method combining dynamic information and static information.

Background

Along with the rapid increase of the number of videos, how to analyze and understand the content of the videos is more important, and how to improve the accuracy of semantic segmentation by using video semantic segmentation as one important step of content understanding is also a problem to be solved urgently.

The chinese patent with publication No. CN113139502a discloses a method, an apparatus, an electronic device, and a storage medium for video semantic segmentation, which proposes that accuracy of image segmentation is improved by multi-mode picture information, which is certainly enough in the classification field, but is promoted to the multi-classification field to segment only by multi-mode images, and the segmentation accuracy is far insufficient.

Disclosure of Invention

Aiming at the defects or improvement demands of the prior art, the invention provides a video semantic segmentation method combining dynamic information and static information, which aims to realize effective segmentation of video and improve the accuracy of video semantic segmentation.

In order to achieve the above object, according to one aspect of the present invention, there is provided a video semantic segmentation method combining dynamic information with static information, comprising the steps of:

step 1, constructing a video semantic segmentation network architecture combining dynamic information and static information;

the video semantic segmentation network architecture is provided with 3 reference frames which are respectively used for processing a video frame at the current moment T, a video frame at the moment T-1 and a video frame at the moment T-2; each reference frame uses a time sequence feature encoder to extract features and outputs a feature map of the corresponding reference frame through a convolution layer; splicing the output characteristic diagram of the second reference system with the output characteristic diagram of the third reference system, sending the output characteristic diagram of the second reference system and the output characteristic diagram of the third reference system to a position learning module to learn position information to obtain a dynamic information characteristic diagram, adding the dynamic information characteristic diagram and the static information characteristic diagram obtained after the output characteristic diagram of the first reference system is learned by the position learning module to obtain a characteristic representation with dynamic information and static information, sending the characteristic representation with the dynamic information and the static information to the position learning module to learn, sending the characteristic representation with the dynamic information and the static information to a decoder to perform characteristic decoding, and finally obtaining the subscript of the maximum value of the category prediction of each corresponding pixel point to obtain a final prediction mask;

step 2, designing a loss function, and training on a data set to obtain a video semantic segmentation model;

and 3, using a video semantic segmentation model to realize intelligent segmentation of the video.

Further, the time sequence feature encoder is divided into four time sequence feature encoding layers, wherein the first two layers are composed of time sequence feature residual blocks, and the second two layers are composed of time sequence feature random discarding residual blocks;

the first layer and the second layer of time sequence feature coding layers are respectively composed of K1 time sequence feature residual blocks and K2 time sequence feature residual blocks, and the third layer and the fourth layer of time sequence feature coding layers are respectively composed of K3 time sequence feature random discarding residual blocks and K4 time sequence feature random discarding residual blocks;

the time sequence characteristic residual block consists of a convolution layer, a layer normalization layer, a depth convolution layer, an activation layer and a convolution layer, wherein the characteristic diagram of the input time sequence characteristic residual block sequentially passes through the layers, and then the characteristic diagram adding operation is carried out on the characteristic diagram of the input time sequence characteristic residual block and the characteristic diagram of the residual branch to output the characteristic diagram; the time sequence characteristic random discarding residual block consists of a convolution layer, a layer normalization layer, an activation layer, a convolution layer and a random discarding layer, wherein the characteristic diagram of the input time sequence characteristic random discarding residual block sequentially passes through the first four layers, then the characteristic diagram adding operation is carried out through residual branches and the characteristic diagram of the input time sequence characteristic random discarding residual block, and then the characteristic diagram is output after passing through a random discarding layer;

furthermore, the RELU activation function is used by the activation layer, and the Drop path operation is used by the random discarding layer.

Further, a first 5×5 convolution layer of a first time sequence feature residual block in the first two time sequence feature coding layers of the time sequence feature coder sets a step length to be 2 to reduce the height and width of the feature graphs, at this time, a 2×2 convolution layer is used for reducing the height and width of the feature graphs in residual branches of the time sequence feature residual blocks to keep the consistency of the feature graphs when the feature graphs are added, and other time sequence feature residual blocks do not perform the operation; the first 7×7 convolution layer of the residual block is randomly discarded by the first time sequence feature in the two last time sequence feature coding layers of the time sequence feature coder, the step length is set to be 2 to reduce the height width of the feature images, at the moment, a 2×2 convolution layer is used for reducing the height width of the feature images on the residual branches of the time sequence feature residual blocks to keep the consistency of the feature images when the feature images are added, and other time sequence feature random discarding residual blocks do not do the operation.

Further, the specific processing procedure of the position learning module is as follows;

after the feature map is input to the position learning module, the feature map is divided into three branches and simultaneously subjected to feature map deformation operation, the feature map with the dimension of C multiplied by H multiplied by W is subjected to two-dimensional merging to be changed into C multiplied by H multiplied by W, then the first branch is subjected to dimension conversion to change the dimension of the feature map with the dimension of the first dimension and the second dimension into (H multiplied by W) multiplied by C, then the obtained matrix is subjected to matrix multiplication with the matrix on the third branch, and firstly, one (H multiplied by W) multiplied by H multiplied by W is obtained in the two operations, then a C multiplied by H multiplied by W matrix is obtained, and the matrix is deformed to obtain a tensor of C multiplied by H multiplied by W; finally, 1×1 convolution is performed to obtain 1×h×w, and then the feature map is added to the corresponding position of the feature map before the input position learning module, so as to obtain a final output result.

Further, the feature map output by the time sequence feature encoder of the first reference frame is used for extracting features through a 5×5 convolution, and the feature map of the first reference frame is output; the feature map output by the time sequence feature encoder of the second reference frame is used for extracting features through 7×7 convolution, and the feature map of the second reference frame is output. The feature map output by the time sequence feature encoder of the third reference frame is used for extracting features through 11×11 convolution, and the feature map of the third reference frame is output.

Further, the loss function designed in the step 2 is a position weighted loss function L _p Loss L from two parts ₁ And L ₂ Composition, L ₁ And L ₂ The specific formula is as follows:

formula L ₁ And L ₂ Wherein C is the number of classes of pixels, N is the number of pixels in the mask, y _ij Representing the i-th pixel for the j-th class of real labels, p _ij Representing the predictive probability, alpha, of the jth class of the ith pixel _j Different weights, w, are set for different categories j _i Position weights are used for distributing different weights for pixels at different positions, epsilon is a minimum value and is used for avoiding the situation that denominator is 0; l (L) ₁ And L ₂ Composition position weighted loss function L _p The formula of (2) is as follows:

wherein λ is the weight that the loss weight is used to control the loss of the latter part; 1-L ₂ I is for 1-L ₂ The absolute value is partly calculated.

Further, alpha _j The value of (2) is determined by the object to be segmented, and the weight assigned by the object easy to be segmented is smaller than that of other objects; w (w) _i The value of (2) is determined by the position of the pixel in the image, and the position weight of the pixel in the middle of the image is greater than that of the pixel at the edge of the image.

In general, the above technical solutions conceived by the present invention, compared with the prior art, enable the following beneficial effects to be obtained:

(1) By improving the network structure, deepening the depth of the network and adding a random discarding layer in the deep layer of the network, the network overfitting can be prevented, and the learning ability and generalization of the network can be improved.

(2) A loss function is designed to simultaneously focus on the prediction of pixel level and the prediction of object edge information.

(3) A position learning module is designed, the correlation of the positions in the feature map is learned through matrix multiplication and convolution, and position weights are given to the feature map, so that the sensitivity of the network to dynamic information and static information is improved, and the segmentation accuracy is improved.

Drawings

Fig. 1 is a flowchart of a technical scheme of a video semantic segmentation method combining dynamic information and static information according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a timing sequence feature residual block structure of a video semantic segmentation method combining dynamic information and static information according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a sequential feature random discarding residual block structure of a video semantic segmentation method combining dynamic information and static information according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of a position learning module of a video semantic segmentation method combining dynamic information and static information according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of a network framework of a video semantic segmentation method combining dynamic information and static information according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Referring to fig. 1, fig. 1 is a flowchart of a technical scheme of a video semantic segmentation method combining dynamic information and static information, which is provided by an embodiment, specifically includes the following steps:

(1) Constructing a video semantic segmentation network architecture combining dynamic information and static information;

specifically, referring to fig. 5, fig. 5 is a schematic diagram of a network frame of a video semantic segmentation method based on dynamic information and static information combination according to an embodiment of the present invention.

First, the network is provided with 3 reference frames for processing the video frame at the current time T, the video frame at time T-1 and the video frame at time T-2, respectively. And in special cases, if the video frame at the current time T is a first frame, then the reference frames at time T-1 and time T-2 use the video frame at the current time T, and if the video frame at the current time T is a second frame, then the reference frames at time T-1 and time T-2 use the video frame at time T-1.

Second, each frame of reference of the network uses a temporal feature encoder to extract features, where the structure of the temporal feature encoders for the 3 frames of reference are identical. The feature map output by the time sequence feature encoder of the first reference frame is used for extracting features through a 5×5 convolution, and the feature map of the first reference frame is output. The feature map output by the time sequence feature encoder of the second reference frame is used for extracting features through 7×7 convolution, and the feature map of the second reference frame is output. The feature map output by the time sequence feature encoder of the third reference frame is used for extracting features through 11×11 convolution, and the feature map of the third reference frame is output. By integrating the information of the video frames at different times by using convolution of different scales, the convolution kernel used is larger as the time is farther, because the difference between the object to be segmented and the current time is larger, and the larger convolution kernel is needed for feature representation.

Finally, the output feature map of the second reference frame is spliced with the output feature map of the third reference frame, the obtained result is sent to a position learning module to learn position information to obtain a dynamic information feature map, the position learning module here, please refer to fig. 4, fig. 4 is a schematic diagram of a position learning module structure of a video semantic segmentation method based on combination of dynamic information and static information provided by an embodiment, the dynamic information feature map and the static information feature map obtained after the output feature map of the first reference frame is learned by a position learning model are added to obtain a feature representation with dynamic information and static information, the feature representation with dynamic information and static information is sent to a position learning module to learn and then sent to a decoder to perform feature decoding, and finally, the subscript of the prediction maximum value of each corresponding pixel point category is obtained to obtain a final prediction mask.

The decoder used here is a two-stage feature map decoding structure commonly used in the field of video segmentation.

Specifically, the timing characteristic encoder is composed of two kinds of residual blocks, wherein the two kinds of residual blocks are respectively a timing characteristic residual block and a timing characteristic random discarding residual block, please refer to fig. 2 and fig. 3, fig. 2 is a schematic diagram of a timing characteristic residual block structure of a video semantic segmentation method based on combination of dynamic information and static information provided by an embodiment, and fig. 3 is a schematic diagram of a timing characteristic random discarding residual block structure of a video semantic segmentation method based on combination of dynamic information and static information provided by an embodiment. The time sequence feature encoder can be divided into four time sequence feature encoding layers, wherein the first two layers are composed of time sequence feature residual blocks, and the second two layers are composed of time sequence feature random discarded residual blocks.

The first layer and the second layer of time sequence feature coding layer are respectively composed of 4 time sequence feature residual blocks and 6 time sequence feature residual blocks, the third layer and the fourth layer of time sequence feature coding layer are respectively composed of 9 time sequence feature random discarding residual blocks and 15 time sequence feature random discarding residual blocks, and the values are the best parameters determined through experiments.

Specifically, the specific operation of the location learning module: after the feature map is input to the position learning module, the feature map is divided into three branches and subjected to feature map deformation operation, the feature map with the dimension of C multiplied by H multiplied by W is subjected to dimension combination to be changed into C multiplied by H multiplied by W, then the first branch is subjected to dimension conversion to change the dimension of the feature map with the dimension of the first dimension and the second dimension into (H multiplied by W) multiplied by C, then the obtained matrix is subjected to matrix multiplication with the matrix on the third branch, and the two operations firstly obtain a (H multiplied by W) multiplied by H (H multiplied by W) matrix, then a C multiplied by H multiplied by W) matrix is obtained, and the matrix is deformed to obtain a tensor of C multiplied by H multiplied by W. Finally, 1×1 convolution is performed to obtain 1×h×w, and then the feature map is added to the corresponding position of the feature map before the input position learning module, so as to obtain a final output result.

Specifically, the timing characteristic residual block and the timing random discard residual block are specifically constituted. The time sequence characteristic residual block consists of a 5×5 convolution layer, a layer normalization layer, a 3×3 depth convolution layer, an activation layer and a 1×1 convolution layer, wherein the characteristic diagram of the input time sequence characteristic residual block sequentially passes through the layers, and then the characteristic diagram adding operation is carried out on the characteristic diagram of the input time sequence characteristic residual block and the characteristic diagram of the residual branch to output the characteristic diagram. The time sequence characteristic random discarding residual block consists of a 7×7 convolution layer, a layer normalization layer, an activation layer, a 1×1 convolution layer and a random discarding layer, wherein the characteristic diagram of the input time sequence characteristic random discarding residual block sequentially passes through the first four layers, then the characteristic diagram adding operation is carried out on the characteristic diagram of the residual block and the characteristic diagram of the input time sequence characteristic random discarding residual block through residual branches, and then the characteristic diagram is output after passing through a random discarding layer. The RELU activation function is used by the activation layer, and the Drop path operation is used by the random discard layer.

Specifically, the specific arrangement of two residual blocks in the timing characteristic encoder. The first 5×5 convolution layer of the first time sequence feature residual block in the first two time sequence feature coding layers of the time sequence feature coder sets the step length to be 2 to reduce the height width of the feature graphs, at this time, a 2×2 convolution layer is used for reducing the height width of the feature graphs to maintain the consistency of the feature graphs when the feature graphs are added in the residual branches of the time sequence feature residual blocks, and other time sequence feature residual blocks do not do so, so the reason is that: the step length of the feature coding layer is set to 2, so that the size of the feature map can be reduced, and other feature coding layers only need to learn features and do not need to reduce the size of the feature map. The first 7×7 convolution layer of the residual block is randomly discarded by the first time sequence feature in the two last time sequence feature coding layers of the time sequence feature coder, the step length is set to be 2 to reduce the height width of the feature map, at the moment, a 2×2 convolution layer is used for reducing the height width of the feature map to keep the consistency of the feature map when the feature maps are added in the residual branches of the time sequence feature residual block, and other time sequence feature random discarding residual blocks do not do the operation.

(2) Designing a loss function, and training on an urban landscape data set to obtain a video semantic segmentation model, wherein the urban landscape data set has 19 classifications, a label of a picture stores pixel values in the picture in a range of 0 to 18 in a single-channel image, and each classification corresponds to one pixel value to realize classification in a pixel dimension, which is commonly called a mask image;

specifically, the loss function designed in the step 2 is a position weighted loss number L _p Loss L from two parts ₁ And L ₂ Composition, L ₁ And L ₂ The specific formula is as follows:

formula L ₁ And L ₂ Wherein C is the number of classes of pixels, N is the number of pixels in the mask, y _ij Representing the i-th pixel for the j-th class of real labels, p _ij Representing the predictive probability, alpha, of the jth class of the ith pixel _j Different weights are set for different categories j, lower weights are allocated to objects which are easy to split, such as a background and a person, and other categories are allocated with larger weights according to experimental effects, and the ratio of the two weights is 9:10, w _i Is a position weight, and pixels at different positions are assigned different weights, wherein the position weight in the middle of the image is greater than the weight of the image edge by 1.1 and 1 respectively, epsilon is a minimum value usually set to 0.0004, so as to avoid the situation that the denominator is 0. L (L) ₁ And L ₂ Composition position weighted loss function L _p The formula of (2) is as follows:

where λ is the weight of the loss weight used to control the loss of the latter part, and is typically set to 0.8. 1-L ₂ I is for 1-L ₂ The absolute value is partially calculated, and the partial loss is represented by L ₂ While keeping 1-L as small as possible ₂ As small as possible, the network training can be focused more on the accuracy of boundary pixel segmentation. Position weighted loss function L _p Such a combination can be performed so that the network training is focused on the overall division situation and also focused on the division situation of the edge information.

(3) And using a video semantic segmentation model to realize intelligent segmentation of the video.

The invention provides a video semantic segmentation method combining dynamic information and static information, which can realize efficient segmentation of video by improving network results and designing a loss function, solves the problem that the video segmentation needs to be realized manually, and provides a high-accuracy video segmentation network construction strategy. Compared with the existing advanced video semantic segmentation method, the urban landscape data set is improved by 0.8% in average cross ratio index.

Various modifications and alterations to this application may be made by those skilled in the art without departing from the spirit and scope of this application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. The video semantic segmentation method combining dynamic information and static information is characterized by comprising the following steps of:

the video semantic segmentation network architecture is provided with 3 reference frames which are respectively used for processing a video frame at the current moment T, a video frame at the moment T-1 and a video frame at the moment T-2; each reference frame uses a time sequence feature encoder to extract features and outputs a feature map of the corresponding reference frame through a convolution layer; splicing the output characteristic diagram of the second reference system with the output characteristic diagram of the third reference system, sending the obtained result to a position learning module to learn position information to obtain a dynamic information characteristic diagram, adding the dynamic information characteristic diagram with the static information characteristic diagram obtained after the output characteristic diagram of the first reference system is learned by the position learning module to obtain a characteristic representation with dynamic information and static information, sending the characteristic representation with the dynamic information and the static information to the position learning module to learn, sending the obtained characteristic representation with the dynamic information and the static information to a decoder to perform characteristic decoding, and finally obtaining the subscript of the maximum value of the prediction of each corresponding pixel point category to obtain a final prediction mask;

2. The video semantic segmentation method combining dynamic information and static information according to claim 1, wherein: the time sequence feature encoder is divided into four time sequence feature encoding layers, wherein the first two layers are composed of time sequence feature residual blocks, and the second two layers are composed of time sequence feature random discarded residual blocks;

the time sequence characteristic residual block consists of a convolution layer, a layer normalization layer, a depth convolution layer, an activation layer and a convolution layer, wherein the characteristic diagram of the input time sequence characteristic residual block sequentially passes through the layers, and then the characteristic diagram adding operation is carried out on the characteristic diagram of the input time sequence characteristic residual block and the characteristic diagram of the residual branch to output the characteristic diagram; the time sequence characteristic random discarding residual block consists of a convolution layer, a layer normalization layer, an activation layer, a convolution layer and a random discarding layer, wherein the characteristic diagram of the input time sequence characteristic random discarding residual block sequentially passes through the first four layers, then the characteristic diagram adding operation is carried out through the residual branch and the characteristic diagram of the input time sequence characteristic random discarding residual block, and then the characteristic diagram is output after passing through a random discarding layer.

3. The video semantic segmentation method combining dynamic information and static information according to claim 2, wherein: the RELU activation function is used in the activation layer, and the Drop path operation is used in the random discarding layer.

4. The video semantic segmentation method combining dynamic information and static information according to claim 2, wherein: a first 5 multiplied by 5 convolution layer of a first time sequence characteristic residual block in the first two time sequence characteristic coding layers of the time sequence characteristic coder is set to be 2 for reducing the height width of the characteristic diagram, at the moment, a 2 multiplied by 2 convolution layer is used for reducing the height width of the characteristic diagram in the residual branch of the time sequence characteristic residual block to keep the consistency of the size of the characteristic diagram when the characteristic diagrams are added, and other time sequence characteristic residual blocks do not do the operation; the first 7×7 convolution layer of the residual block is randomly discarded by the first time sequence feature in the two last time sequence feature coding layers of the time sequence feature coder, the step length is set to be 2 to reduce the height width of the feature images, at the moment, a 2×2 convolution layer is used for reducing the height width of the feature images on the residual branches of the time sequence feature residual blocks to keep the consistency of the feature images when the feature images are added, and other time sequence feature random discarding residual blocks do not do the operation.

5. The video semantic segmentation method combining dynamic information and static information according to claim 1, wherein: the specific processing procedure of the position learning module is as follows;

6. The video semantic segmentation method combining dynamic information and static information according to claim 1, wherein: extracting features from the feature map output by the time sequence feature encoder of the first reference frame through a 5×5 convolution, and outputting the feature map of the first reference frame; extracting features from the feature map output by the time sequence feature encoder of the second reference frame through 7×7 convolution, and outputting the feature map of the second reference frame; the feature map output by the time sequence feature encoder of the third reference frame is used for extracting features through 11×11 convolution, and the feature map of the third reference frame is output.

7. A video combining dynamic information with static information as recited in claim 1The semantic segmentation method is characterized by comprising the following steps of: the loss function designed in the step 2 is a position weighted loss function L _p Loss L from two parts ₁ And L ₂ Composition, L ₁ And L ₂ The specific formula is as follows:

；

；

；

8. The video semantic segmentation method combining dynamic information and static information according to claim 7, wherein: alpha _j The value of (2) is determined by the object to be segmented, and the weight assigned by the object easy to be segmented is smaller than that of other objects; w (w) _i The value of (2) is determined by the position of the pixel in the image, and the position weight of the pixel in the middle of the image is greater than that of the pixel at the edge of the image.