CN116246075A - Video semantic segmentation method combining dynamic information and static information - Google Patents
Video semantic segmentation method combining dynamic information and static information Download PDFInfo
- Publication number
- CN116246075A CN116246075A CN202310536770.7A CN202310536770A CN116246075A CN 116246075 A CN116246075 A CN 116246075A CN 202310536770 A CN202310536770 A CN 202310536770A CN 116246075 A CN116246075 A CN 116246075A
- Authority
- CN
- China
- Prior art keywords
- time sequence
- feature
- characteristic
- layer
- multiplied
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 61
- 230000003068 static effect Effects 0.000 title claims abstract description 41
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000012549 training Methods 0.000 claims abstract description 6
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 66
- 238000010586 diagram Methods 0.000 claims description 54
- 230000006870 function Effects 0.000 claims description 17
- 239000011159 matrix material Substances 0.000 claims description 17
- 230000004913 activation Effects 0.000 claims description 12
- 238000010606 normalization Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 5
- 230000008859 change Effects 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 238000010276 construction Methods 0.000 abstract description 2
- 206010063385 Intellectualisation Diseases 0.000 abstract 1
- 230000004927 fusion Effects 0.000 abstract 1
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000002123 temporal effect Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000003709 image segmentation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/40—Scaling of whole images or parts thereof, e.g. expanding or contracting
- G06T3/4038—Image mosaicing, e.g. composing plane images from plane sub-images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/11—Region-based segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/13—Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2200/00—Indexing scheme for image data processing or generation, in general
- G06T2200/32—Indexing scheme for image data processing or generation, in general involving image mosaicing
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Medical Informatics (AREA)
- Computational Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Algebra (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The invention discloses a video semantic segmentation method combining dynamic information and static information, which comprises the following steps: firstly, a video semantic segmentation network with dynamic and static information fusion is constructed, then a loss function is designed, a video semantic segmentation model is obtained through training on a video semantic segmentation data set, and finally, intelligent segmentation of a video is realized by using the model. The invention improves the average merging ratio of video segmentation by improving the video semantic segmentation model and the loss function, provides a high-precision video semantic segmentation network construction strategy, provides references for the intellectualization of video segmentation, and greatly saves labor cost.
Description
Technical Field
The invention relates to the field of video semantic segmentation, and in particular relates to a video semantic segmentation method combining dynamic information and static information.
Background
Along with the rapid increase of the number of videos, how to analyze and understand the content of the videos is more important, and how to improve the accuracy of semantic segmentation by using video semantic segmentation as one important step of content understanding is also a problem to be solved urgently.
The chinese patent with publication No. CN113139502a discloses a method, an apparatus, an electronic device, and a storage medium for video semantic segmentation, which proposes that accuracy of image segmentation is improved by multi-mode picture information, which is certainly enough in the classification field, but is promoted to the multi-classification field to segment only by multi-mode images, and the segmentation accuracy is far insufficient.
Disclosure of Invention
Aiming at the defects or improvement demands of the prior art, the invention provides a video semantic segmentation method combining dynamic information and static information, which aims to realize effective segmentation of video and improve the accuracy of video semantic segmentation.
In order to achieve the above object, according to one aspect of the present invention, there is provided a video semantic segmentation method combining dynamic information with static information, comprising the steps of:
step 1, constructing a video semantic segmentation network architecture combining dynamic information and static information;
the video semantic segmentation network architecture is provided with 3 reference frames which are respectively used for processing a video frame at the current moment T, a video frame at the moment T-1 and a video frame at the moment T-2; each reference frame uses a time sequence feature encoder to extract features and outputs a feature map of the corresponding reference frame through a convolution layer; splicing the output characteristic diagram of the second reference system with the output characteristic diagram of the third reference system, sending the output characteristic diagram of the second reference system and the output characteristic diagram of the third reference system to a position learning module to learn position information to obtain a dynamic information characteristic diagram, adding the dynamic information characteristic diagram and the static information characteristic diagram obtained after the output characteristic diagram of the first reference system is learned by the position learning module to obtain a characteristic representation with dynamic information and static information, sending the characteristic representation with the dynamic information and the static information to the position learning module to learn, sending the characteristic representation with the dynamic information and the static information to a decoder to perform characteristic decoding, and finally obtaining the subscript of the maximum value of the category prediction of each corresponding pixel point to obtain a final prediction mask;
step 2, designing a loss function, and training on a data set to obtain a video semantic segmentation model;
and 3, using a video semantic segmentation model to realize intelligent segmentation of the video.
Further, the time sequence feature encoder is divided into four time sequence feature encoding layers, wherein the first two layers are composed of time sequence feature residual blocks, and the second two layers are composed of time sequence feature random discarding residual blocks;
the first layer and the second layer of time sequence feature coding layers are respectively composed of K1 time sequence feature residual blocks and K2 time sequence feature residual blocks, and the third layer and the fourth layer of time sequence feature coding layers are respectively composed of K3 time sequence feature random discarding residual blocks and K4 time sequence feature random discarding residual blocks;
the time sequence characteristic residual block consists of a convolution layer, a layer normalization layer, a depth convolution layer, an activation layer and a convolution layer, wherein the characteristic diagram of the input time sequence characteristic residual block sequentially passes through the layers, and then the characteristic diagram adding operation is carried out on the characteristic diagram of the input time sequence characteristic residual block and the characteristic diagram of the residual branch to output the characteristic diagram; the time sequence characteristic random discarding residual block consists of a convolution layer, a layer normalization layer, an activation layer, a convolution layer and a random discarding layer, wherein the characteristic diagram of the input time sequence characteristic random discarding residual block sequentially passes through the first four layers, then the characteristic diagram adding operation is carried out through residual branches and the characteristic diagram of the input time sequence characteristic random discarding residual block, and then the characteristic diagram is output after passing through a random discarding layer;
furthermore, the RELU activation function is used by the activation layer, and the Drop path operation is used by the random discarding layer.
Further, a first 5×5 convolution layer of a first time sequence feature residual block in the first two time sequence feature coding layers of the time sequence feature coder sets a step length to be 2 to reduce the height and width of the feature graphs, at this time, a 2×2 convolution layer is used for reducing the height and width of the feature graphs in residual branches of the time sequence feature residual blocks to keep the consistency of the feature graphs when the feature graphs are added, and other time sequence feature residual blocks do not perform the operation; the first 7×7 convolution layer of the residual block is randomly discarded by the first time sequence feature in the two last time sequence feature coding layers of the time sequence feature coder, the step length is set to be 2 to reduce the height width of the feature images, at the moment, a 2×2 convolution layer is used for reducing the height width of the feature images on the residual branches of the time sequence feature residual blocks to keep the consistency of the feature images when the feature images are added, and other time sequence feature random discarding residual blocks do not do the operation.
Further, the specific processing procedure of the position learning module is as follows;
after the feature map is input to the position learning module, the feature map is divided into three branches and simultaneously subjected to feature map deformation operation, the feature map with the dimension of C multiplied by H multiplied by W is subjected to two-dimensional merging to be changed into C multiplied by H multiplied by W, then the first branch is subjected to dimension conversion to change the dimension of the feature map with the dimension of the first dimension and the second dimension into (H multiplied by W) multiplied by C, then the obtained matrix is subjected to matrix multiplication with the matrix on the third branch, and firstly, one (H multiplied by W) multiplied by H multiplied by W is obtained in the two operations, then a C multiplied by H multiplied by W matrix is obtained, and the matrix is deformed to obtain a tensor of C multiplied by H multiplied by W; finally, 1×1 convolution is performed to obtain 1×h×w, and then the feature map is added to the corresponding position of the feature map before the input position learning module, so as to obtain a final output result.
Further, the feature map output by the time sequence feature encoder of the first reference frame is used for extracting features through a 5×5 convolution, and the feature map of the first reference frame is output; the feature map output by the time sequence feature encoder of the second reference frame is used for extracting features through 7×7 convolution, and the feature map of the second reference frame is output. The feature map output by the time sequence feature encoder of the third reference frame is used for extracting features through 11×11 convolution, and the feature map of the third reference frame is output.
Further, the loss function designed in the step 2 is a position weighted loss function L p Loss L from two parts 1 And L 2 Composition, L 1 And L 2 The specific formula is as follows:
formula L 1 And L 2 Wherein C is the number of classes of pixels, N is the number of pixels in the mask, y ij Representing the i-th pixel for the j-th class of real labels, p ij Representing the predictive probability, alpha, of the jth class of the ith pixel j Different weights, w, are set for different categories j i Position weights are used for distributing different weights for pixels at different positions, epsilon is a minimum value and is used for avoiding the situation that denominator is 0; l (L) 1 And L 2 Composition position weighted loss function L p The formula of (2) is as follows:
wherein λ is the weight that the loss weight is used to control the loss of the latter part; 1-L 2 I is for 1-L 2 The absolute value is partly calculated.
Further, alpha j The value of (2) is determined by the object to be segmented, and the weight assigned by the object easy to be segmented is smaller than that of other objects; w (w) i The value of (2) is determined by the position of the pixel in the image, and the position weight of the pixel in the middle of the image is greater than that of the pixel at the edge of the image.
In general, the above technical solutions conceived by the present invention, compared with the prior art, enable the following beneficial effects to be obtained:
(1) By improving the network structure, deepening the depth of the network and adding a random discarding layer in the deep layer of the network, the network overfitting can be prevented, and the learning ability and generalization of the network can be improved.
(2) A loss function is designed to simultaneously focus on the prediction of pixel level and the prediction of object edge information.
(3) A position learning module is designed, the correlation of the positions in the feature map is learned through matrix multiplication and convolution, and position weights are given to the feature map, so that the sensitivity of the network to dynamic information and static information is improved, and the segmentation accuracy is improved.
Drawings
Fig. 1 is a flowchart of a technical scheme of a video semantic segmentation method combining dynamic information and static information according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a timing sequence feature residual block structure of a video semantic segmentation method combining dynamic information and static information according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a sequential feature random discarding residual block structure of a video semantic segmentation method combining dynamic information and static information according to an embodiment of the present invention.
Fig. 4 is a schematic structural diagram of a position learning module of a video semantic segmentation method combining dynamic information and static information according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of a network framework of a video semantic segmentation method combining dynamic information and static information according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
Referring to fig. 1, fig. 1 is a flowchart of a technical scheme of a video semantic segmentation method combining dynamic information and static information, which is provided by an embodiment, specifically includes the following steps:
(1) Constructing a video semantic segmentation network architecture combining dynamic information and static information;
specifically, referring to fig. 5, fig. 5 is a schematic diagram of a network frame of a video semantic segmentation method based on dynamic information and static information combination according to an embodiment of the present invention.
First, the network is provided with 3 reference frames for processing the video frame at the current time T, the video frame at time T-1 and the video frame at time T-2, respectively. And in special cases, if the video frame at the current time T is a first frame, then the reference frames at time T-1 and time T-2 use the video frame at the current time T, and if the video frame at the current time T is a second frame, then the reference frames at time T-1 and time T-2 use the video frame at time T-1.
Second, each frame of reference of the network uses a temporal feature encoder to extract features, where the structure of the temporal feature encoders for the 3 frames of reference are identical. The feature map output by the time sequence feature encoder of the first reference frame is used for extracting features through a 5×5 convolution, and the feature map of the first reference frame is output. The feature map output by the time sequence feature encoder of the second reference frame is used for extracting features through 7×7 convolution, and the feature map of the second reference frame is output. The feature map output by the time sequence feature encoder of the third reference frame is used for extracting features through 11×11 convolution, and the feature map of the third reference frame is output. By integrating the information of the video frames at different times by using convolution of different scales, the convolution kernel used is larger as the time is farther, because the difference between the object to be segmented and the current time is larger, and the larger convolution kernel is needed for feature representation.
Finally, the output feature map of the second reference frame is spliced with the output feature map of the third reference frame, the obtained result is sent to a position learning module to learn position information to obtain a dynamic information feature map, the position learning module here, please refer to fig. 4, fig. 4 is a schematic diagram of a position learning module structure of a video semantic segmentation method based on combination of dynamic information and static information provided by an embodiment, the dynamic information feature map and the static information feature map obtained after the output feature map of the first reference frame is learned by a position learning model are added to obtain a feature representation with dynamic information and static information, the feature representation with dynamic information and static information is sent to a position learning module to learn and then sent to a decoder to perform feature decoding, and finally, the subscript of the prediction maximum value of each corresponding pixel point category is obtained to obtain a final prediction mask.
The decoder used here is a two-stage feature map decoding structure commonly used in the field of video segmentation.
Specifically, the timing characteristic encoder is composed of two kinds of residual blocks, wherein the two kinds of residual blocks are respectively a timing characteristic residual block and a timing characteristic random discarding residual block, please refer to fig. 2 and fig. 3, fig. 2 is a schematic diagram of a timing characteristic residual block structure of a video semantic segmentation method based on combination of dynamic information and static information provided by an embodiment, and fig. 3 is a schematic diagram of a timing characteristic random discarding residual block structure of a video semantic segmentation method based on combination of dynamic information and static information provided by an embodiment. The time sequence feature encoder can be divided into four time sequence feature encoding layers, wherein the first two layers are composed of time sequence feature residual blocks, and the second two layers are composed of time sequence feature random discarded residual blocks.
The first layer and the second layer of time sequence feature coding layer are respectively composed of 4 time sequence feature residual blocks and 6 time sequence feature residual blocks, the third layer and the fourth layer of time sequence feature coding layer are respectively composed of 9 time sequence feature random discarding residual blocks and 15 time sequence feature random discarding residual blocks, and the values are the best parameters determined through experiments.
Specifically, the specific operation of the location learning module: after the feature map is input to the position learning module, the feature map is divided into three branches and subjected to feature map deformation operation, the feature map with the dimension of C multiplied by H multiplied by W is subjected to dimension combination to be changed into C multiplied by H multiplied by W, then the first branch is subjected to dimension conversion to change the dimension of the feature map with the dimension of the first dimension and the second dimension into (H multiplied by W) multiplied by C, then the obtained matrix is subjected to matrix multiplication with the matrix on the third branch, and the two operations firstly obtain a (H multiplied by W) multiplied by H (H multiplied by W) matrix, then a C multiplied by H multiplied by W) matrix is obtained, and the matrix is deformed to obtain a tensor of C multiplied by H multiplied by W. Finally, 1×1 convolution is performed to obtain 1×h×w, and then the feature map is added to the corresponding position of the feature map before the input position learning module, so as to obtain a final output result.
Specifically, the timing characteristic residual block and the timing random discard residual block are specifically constituted. The time sequence characteristic residual block consists of a 5×5 convolution layer, a layer normalization layer, a 3×3 depth convolution layer, an activation layer and a 1×1 convolution layer, wherein the characteristic diagram of the input time sequence characteristic residual block sequentially passes through the layers, and then the characteristic diagram adding operation is carried out on the characteristic diagram of the input time sequence characteristic residual block and the characteristic diagram of the residual branch to output the characteristic diagram. The time sequence characteristic random discarding residual block consists of a 7×7 convolution layer, a layer normalization layer, an activation layer, a 1×1 convolution layer and a random discarding layer, wherein the characteristic diagram of the input time sequence characteristic random discarding residual block sequentially passes through the first four layers, then the characteristic diagram adding operation is carried out on the characteristic diagram of the residual block and the characteristic diagram of the input time sequence characteristic random discarding residual block through residual branches, and then the characteristic diagram is output after passing through a random discarding layer. The RELU activation function is used by the activation layer, and the Drop path operation is used by the random discard layer.
Specifically, the specific arrangement of two residual blocks in the timing characteristic encoder. The first 5×5 convolution layer of the first time sequence feature residual block in the first two time sequence feature coding layers of the time sequence feature coder sets the step length to be 2 to reduce the height width of the feature graphs, at this time, a 2×2 convolution layer is used for reducing the height width of the feature graphs to maintain the consistency of the feature graphs when the feature graphs are added in the residual branches of the time sequence feature residual blocks, and other time sequence feature residual blocks do not do so, so the reason is that: the step length of the feature coding layer is set to 2, so that the size of the feature map can be reduced, and other feature coding layers only need to learn features and do not need to reduce the size of the feature map. The first 7×7 convolution layer of the residual block is randomly discarded by the first time sequence feature in the two last time sequence feature coding layers of the time sequence feature coder, the step length is set to be 2 to reduce the height width of the feature map, at the moment, a 2×2 convolution layer is used for reducing the height width of the feature map to keep the consistency of the feature map when the feature maps are added in the residual branches of the time sequence feature residual block, and other time sequence feature random discarding residual blocks do not do the operation.
(2) Designing a loss function, and training on an urban landscape data set to obtain a video semantic segmentation model, wherein the urban landscape data set has 19 classifications, a label of a picture stores pixel values in the picture in a range of 0 to 18 in a single-channel image, and each classification corresponds to one pixel value to realize classification in a pixel dimension, which is commonly called a mask image;
specifically, the loss function designed in the step 2 is a position weighted loss number L p Loss L from two parts 1 And L 2 Composition, L 1 And L 2 The specific formula is as follows:
formula L 1 And L 2 Wherein C is the number of classes of pixels, N is the number of pixels in the mask, y ij Representing the i-th pixel for the j-th class of real labels, p ij Representing the predictive probability, alpha, of the jth class of the ith pixel j Different weights are set for different categories j, lower weights are allocated to objects which are easy to split, such as a background and a person, and other categories are allocated with larger weights according to experimental effects, and the ratio of the two weights is 9:10, w i Is a position weight, and pixels at different positions are assigned different weights, wherein the position weight in the middle of the image is greater than the weight of the image edge by 1.1 and 1 respectively, epsilon is a minimum value usually set to 0.0004, so as to avoid the situation that the denominator is 0. L (L) 1 And L 2 Composition position weighted loss function L p The formula of (2) is as follows:
where λ is the weight of the loss weight used to control the loss of the latter part, and is typically set to 0.8. 1-L 2 I is for 1-L 2 The absolute value is partially calculated, and the partial loss is represented by L 2 While keeping 1-L as small as possible 2 As small as possible, the network training can be focused more on the accuracy of boundary pixel segmentation. Position weighted loss function L p Such a combination can be performed so that the network training is focused on the overall division situation and also focused on the division situation of the edge information.
(3) And using a video semantic segmentation model to realize intelligent segmentation of the video.
The invention provides a video semantic segmentation method combining dynamic information and static information, which can realize efficient segmentation of video by improving network results and designing a loss function, solves the problem that the video segmentation needs to be realized manually, and provides a high-accuracy video segmentation network construction strategy. Compared with the existing advanced video semantic segmentation method, the urban landscape data set is improved by 0.8% in average cross ratio index.
Various modifications and alterations to this application may be made by those skilled in the art without departing from the spirit and scope of this application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.
Claims (8)
1. The video semantic segmentation method combining dynamic information and static information is characterized by comprising the following steps of:
step 1, constructing a video semantic segmentation network architecture combining dynamic information and static information;
the video semantic segmentation network architecture is provided with 3 reference frames which are respectively used for processing a video frame at the current moment T, a video frame at the moment T-1 and a video frame at the moment T-2; each reference frame uses a time sequence feature encoder to extract features and outputs a feature map of the corresponding reference frame through a convolution layer; splicing the output characteristic diagram of the second reference system with the output characteristic diagram of the third reference system, sending the obtained result to a position learning module to learn position information to obtain a dynamic information characteristic diagram, adding the dynamic information characteristic diagram with the static information characteristic diagram obtained after the output characteristic diagram of the first reference system is learned by the position learning module to obtain a characteristic representation with dynamic information and static information, sending the characteristic representation with the dynamic information and the static information to the position learning module to learn, sending the obtained characteristic representation with the dynamic information and the static information to a decoder to perform characteristic decoding, and finally obtaining the subscript of the maximum value of the prediction of each corresponding pixel point category to obtain a final prediction mask;
step 2, designing a loss function, and training on a data set to obtain a video semantic segmentation model;
and 3, using a video semantic segmentation model to realize intelligent segmentation of the video.
2. The video semantic segmentation method combining dynamic information and static information according to claim 1, wherein: the time sequence feature encoder is divided into four time sequence feature encoding layers, wherein the first two layers are composed of time sequence feature residual blocks, and the second two layers are composed of time sequence feature random discarded residual blocks;
the first layer and the second layer of time sequence feature coding layers are respectively composed of K1 time sequence feature residual blocks and K2 time sequence feature residual blocks, and the third layer and the fourth layer of time sequence feature coding layers are respectively composed of K3 time sequence feature random discarding residual blocks and K4 time sequence feature random discarding residual blocks;
the time sequence characteristic residual block consists of a convolution layer, a layer normalization layer, a depth convolution layer, an activation layer and a convolution layer, wherein the characteristic diagram of the input time sequence characteristic residual block sequentially passes through the layers, and then the characteristic diagram adding operation is carried out on the characteristic diagram of the input time sequence characteristic residual block and the characteristic diagram of the residual branch to output the characteristic diagram; the time sequence characteristic random discarding residual block consists of a convolution layer, a layer normalization layer, an activation layer, a convolution layer and a random discarding layer, wherein the characteristic diagram of the input time sequence characteristic random discarding residual block sequentially passes through the first four layers, then the characteristic diagram adding operation is carried out through the residual branch and the characteristic diagram of the input time sequence characteristic random discarding residual block, and then the characteristic diagram is output after passing through a random discarding layer.
3. The video semantic segmentation method combining dynamic information and static information according to claim 2, wherein: the RELU activation function is used in the activation layer, and the Drop path operation is used in the random discarding layer.
4. The video semantic segmentation method combining dynamic information and static information according to claim 2, wherein: a first 5 multiplied by 5 convolution layer of a first time sequence characteristic residual block in the first two time sequence characteristic coding layers of the time sequence characteristic coder is set to be 2 for reducing the height width of the characteristic diagram, at the moment, a 2 multiplied by 2 convolution layer is used for reducing the height width of the characteristic diagram in the residual branch of the time sequence characteristic residual block to keep the consistency of the size of the characteristic diagram when the characteristic diagrams are added, and other time sequence characteristic residual blocks do not do the operation; the first 7×7 convolution layer of the residual block is randomly discarded by the first time sequence feature in the two last time sequence feature coding layers of the time sequence feature coder, the step length is set to be 2 to reduce the height width of the feature images, at the moment, a 2×2 convolution layer is used for reducing the height width of the feature images on the residual branches of the time sequence feature residual blocks to keep the consistency of the feature images when the feature images are added, and other time sequence feature random discarding residual blocks do not do the operation.
5. The video semantic segmentation method combining dynamic information and static information according to claim 1, wherein: the specific processing procedure of the position learning module is as follows;
after the feature map is input to the position learning module, the feature map is divided into three branches and simultaneously subjected to feature map deformation operation, the feature map with the dimension of C multiplied by H multiplied by W is subjected to two-dimensional merging to be changed into C multiplied by H multiplied by W, then the first branch is subjected to dimension conversion to change the dimension of the feature map with the dimension of the first dimension and the second dimension into (H multiplied by W) multiplied by C, then the obtained matrix is subjected to matrix multiplication with the matrix on the third branch, and firstly, one (H multiplied by W) multiplied by H multiplied by W is obtained in the two operations, then a C multiplied by H multiplied by W matrix is obtained, and the matrix is deformed to obtain a tensor of C multiplied by H multiplied by W; finally, 1×1 convolution is performed to obtain 1×h×w, and then the feature map is added to the corresponding position of the feature map before the input position learning module, so as to obtain a final output result.
6. The video semantic segmentation method combining dynamic information and static information according to claim 1, wherein: extracting features from the feature map output by the time sequence feature encoder of the first reference frame through a 5×5 convolution, and outputting the feature map of the first reference frame; extracting features from the feature map output by the time sequence feature encoder of the second reference frame through 7×7 convolution, and outputting the feature map of the second reference frame; the feature map output by the time sequence feature encoder of the third reference frame is used for extracting features through 11×11 convolution, and the feature map of the third reference frame is output.
7. A video combining dynamic information with static information as recited in claim 1The semantic segmentation method is characterized by comprising the following steps of: the loss function designed in the step 2 is a position weighted loss function L p Loss L from two parts 1 And L 2 Composition, L 1 And L 2 The specific formula is as follows:
formula L 1 And L 2 Wherein C is the number of classes of pixels, N is the number of pixels in the mask, y ij Representing the i-th pixel for the j-th class of real labels, p ij Representing the predictive probability, alpha, of the jth class of the ith pixel j Different weights, w, are set for different categories j i Position weights are used for distributing different weights for pixels at different positions, epsilon is a minimum value and is used for avoiding the situation that denominator is 0; l (L) 1 And L 2 Composition position weighted loss function L p The formula of (2) is as follows:
wherein λ is the weight that the loss weight is used to control the loss of the latter part; 1-L 2 I is for 1-L 2 The absolute value is partly calculated.
8. The video semantic segmentation method combining dynamic information and static information according to claim 7, wherein: alpha j The value of (2) is determined by the object to be segmented, and the weight assigned by the object easy to be segmented is smaller than that of other objects; w (w) i The value of (2) is determined by the position of the pixel in the image, and the position weight of the pixel in the middle of the image is greater than that of the pixel at the edge of the image.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310536770.7A CN116246075B (en) | 2023-05-12 | 2023-05-12 | Video semantic segmentation method combining dynamic information and static information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310536770.7A CN116246075B (en) | 2023-05-12 | 2023-05-12 | Video semantic segmentation method combining dynamic information and static information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116246075A true CN116246075A (en) | 2023-06-09 |
CN116246075B CN116246075B (en) | 2023-07-21 |
Family
ID=86633542
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310536770.7A Active CN116246075B (en) | 2023-05-12 | 2023-05-12 | Video semantic segmentation method combining dynamic information and static information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116246075B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200092552A1 (en) * | 2018-09-18 | 2020-03-19 | Google Llc | Receptive-Field-Conforming Convolutional Models for Video Coding |
CN111050219A (en) * | 2018-10-12 | 2020-04-21 | 奥多比公司 | Spatio-temporal memory network for locating target objects in video content |
CN111062395A (en) * | 2019-11-27 | 2020-04-24 | 北京理工大学 | Real-time video semantic segmentation method |
CN111652899A (en) * | 2020-05-29 | 2020-09-11 | 中国矿业大学 | Video target segmentation method of space-time component diagram |
CN113570610A (en) * | 2021-07-26 | 2021-10-29 | 北京百度网讯科技有限公司 | Method and device for performing target segmentation on video by adopting semantic segmentation model |
CN114596520A (en) * | 2022-02-09 | 2022-06-07 | 天津大学 | First visual angle video action identification method and device |
CN114663460A (en) * | 2022-02-28 | 2022-06-24 | 华南农业大学 | Video segmentation method and device based on double-current driving encoder and feature memory module |
CN114973071A (en) * | 2022-05-11 | 2022-08-30 | 中国科学院软件研究所 | Unsupervised video target segmentation method and system based on long-term and short-term time sequence characteristics |
US20230035475A1 (en) * | 2021-07-16 | 2023-02-02 | Huawei Technologies Co., Ltd. | Methods and systems for semantic segmentation of a point cloud |
-
2023
- 2023-05-12 CN CN202310536770.7A patent/CN116246075B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200092552A1 (en) * | 2018-09-18 | 2020-03-19 | Google Llc | Receptive-Field-Conforming Convolutional Models for Video Coding |
CN111050219A (en) * | 2018-10-12 | 2020-04-21 | 奥多比公司 | Spatio-temporal memory network for locating target objects in video content |
CN111062395A (en) * | 2019-11-27 | 2020-04-24 | 北京理工大学 | Real-time video semantic segmentation method |
CN111652899A (en) * | 2020-05-29 | 2020-09-11 | 中国矿业大学 | Video target segmentation method of space-time component diagram |
US20230035475A1 (en) * | 2021-07-16 | 2023-02-02 | Huawei Technologies Co., Ltd. | Methods and systems for semantic segmentation of a point cloud |
CN113570610A (en) * | 2021-07-26 | 2021-10-29 | 北京百度网讯科技有限公司 | Method and device for performing target segmentation on video by adopting semantic segmentation model |
CN114596520A (en) * | 2022-02-09 | 2022-06-07 | 天津大学 | First visual angle video action identification method and device |
CN114663460A (en) * | 2022-02-28 | 2022-06-24 | 华南农业大学 | Video segmentation method and device based on double-current driving encoder and feature memory module |
CN114973071A (en) * | 2022-05-11 | 2022-08-30 | 中国科学院软件研究所 | Unsupervised video target segmentation method and system based on long-term and short-term time sequence characteristics |
Non-Patent Citations (5)
Title |
---|
SIYUE YU: "Fast pixel-matching for video object segmentation", 《SIGNAL PROCESSING:IMAGE COMMUNICATION》, vol. 98, pages 3 - 5 * |
余锋: "基于多头软注意力图卷积网络的行人轨迹预测", 《计算机应用》, vol. 43, no. 03, pages 736 - 743 * |
余锋: "针对多姿态迁移的虚拟试衣算法研究", 《武汉纺织大学学报》, vol. 35, no. 01, pages 3 - 9 * |
景庄伟;管海燕;彭代峰;于永涛;: "基于深度神经网络的图像语义分割研究综述", 《计算机工程》, vol. 46, no. 10, pages 1 - 17 * |
王婷婷: "面向道路交通场景的行人智能检测方法研究", 《中国优秀硕士学位论文全文数据库(工程科技Ⅱ辑), no. 02, pages 034 - 1395 * |
Also Published As
Publication number | Publication date |
---|---|
CN116246075B (en) | 2023-07-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113052210B (en) | Rapid low-light target detection method based on convolutional neural network | |
CN106960206B (en) | Character recognition method and character recognition system | |
CN111368636B (en) | Object classification method, device, computer equipment and storage medium | |
CN112801027B (en) | Vehicle target detection method based on event camera | |
CN112634296A (en) | RGB-D image semantic segmentation method and terminal for guiding edge information distillation through door mechanism | |
CN111738169A (en) | Handwriting formula recognition method based on end-to-end network model | |
CN112084859A (en) | Building segmentation method based on dense boundary block and attention mechanism | |
CN117197763A (en) | Road crack detection method and system based on cross attention guide feature alignment network | |
CN113901924A (en) | Document table detection method and device | |
CN113436198A (en) | Remote sensing image semantic segmentation method for collaborative image super-resolution reconstruction | |
CN117975418A (en) | Traffic sign detection method based on improved RT-DETR | |
CN111429468B (en) | Cell nucleus segmentation method, device, equipment and storage medium | |
CN116246075B (en) | Video semantic segmentation method combining dynamic information and static information | |
CN118230323A (en) | Semantic segmentation method for fusing space detail context and multi-scale interactive image | |
CN116597339A (en) | Video target segmentation method based on mask guide semi-dense contrast learning | |
CN114782995A (en) | Human interaction behavior detection method based on self-attention mechanism | |
CN115205518A (en) | Target detection method and system based on YOLO v5s network structure | |
Wang et al. | Research on Semantic Segmentation Algorithm for Multiscale Feature Images Based on Improved DeepLab v3+ | |
CN114758279B (en) | Video target detection method based on time domain information transfer | |
CN118155219A (en) | Text detection and recognition method for terminal strip coding tube based on deep learning | |
CN118965058A (en) | Language expression-based arbitrary category counting model and counting method thereof | |
CN118968051A (en) | Urban scene real-time semantic segmentation method based on gating alignment network | |
CN118918326A (en) | Urban scene real-time semantic segmentation method based on bilateral fusion network | |
CN115661676A (en) | Building segmentation system and method based on serial attention module and parallel attention module | |
CN118038042A (en) | Infrared image segmentation method and device based on visible light image enhancement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |