[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN116246075A - Video semantic segmentation method combining dynamic information and static information - Google Patents

Video semantic segmentation method combining dynamic information and static information Download PDF

Info

Publication number
CN116246075A
CN116246075A CN202310536770.7A CN202310536770A CN116246075A CN 116246075 A CN116246075 A CN 116246075A CN 202310536770 A CN202310536770 A CN 202310536770A CN 116246075 A CN116246075 A CN 116246075A
Authority
CN
China
Prior art keywords
time sequence
feature
characteristic
layer
multiplied
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310536770.7A
Other languages
Chinese (zh)
Other versions
CN116246075B (en
Inventor
余锋
李会引
姜明华
汤光裕
刘莉
周昌龙
宋坤芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Textile University
Original Assignee
Wuhan Textile University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Textile University filed Critical Wuhan Textile University
Priority to CN202310536770.7A priority Critical patent/CN116246075B/en
Publication of CN116246075A publication Critical patent/CN116246075A/en
Application granted granted Critical
Publication of CN116246075B publication Critical patent/CN116246075B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4038Image mosaicing, e.g. composing plane images from plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/32Indexing scheme for image data processing or generation, in general involving image mosaicing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Algebra (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention discloses a video semantic segmentation method combining dynamic information and static information, which comprises the following steps: firstly, a video semantic segmentation network with dynamic and static information fusion is constructed, then a loss function is designed, a video semantic segmentation model is obtained through training on a video semantic segmentation data set, and finally, intelligent segmentation of a video is realized by using the model. The invention improves the average merging ratio of video segmentation by improving the video semantic segmentation model and the loss function, provides a high-precision video semantic segmentation network construction strategy, provides references for the intellectualization of video segmentation, and greatly saves labor cost.

Description

Video semantic segmentation method combining dynamic information and static information
Technical Field
The invention relates to the field of video semantic segmentation, and in particular relates to a video semantic segmentation method combining dynamic information and static information.
Background
Along with the rapid increase of the number of videos, how to analyze and understand the content of the videos is more important, and how to improve the accuracy of semantic segmentation by using video semantic segmentation as one important step of content understanding is also a problem to be solved urgently.
The chinese patent with publication No. CN113139502a discloses a method, an apparatus, an electronic device, and a storage medium for video semantic segmentation, which proposes that accuracy of image segmentation is improved by multi-mode picture information, which is certainly enough in the classification field, but is promoted to the multi-classification field to segment only by multi-mode images, and the segmentation accuracy is far insufficient.
Disclosure of Invention
Aiming at the defects or improvement demands of the prior art, the invention provides a video semantic segmentation method combining dynamic information and static information, which aims to realize effective segmentation of video and improve the accuracy of video semantic segmentation.
In order to achieve the above object, according to one aspect of the present invention, there is provided a video semantic segmentation method combining dynamic information with static information, comprising the steps of:
step 1, constructing a video semantic segmentation network architecture combining dynamic information and static information;
the video semantic segmentation network architecture is provided with 3 reference frames which are respectively used for processing a video frame at the current moment T, a video frame at the moment T-1 and a video frame at the moment T-2; each reference frame uses a time sequence feature encoder to extract features and outputs a feature map of the corresponding reference frame through a convolution layer; splicing the output characteristic diagram of the second reference system with the output characteristic diagram of the third reference system, sending the output characteristic diagram of the second reference system and the output characteristic diagram of the third reference system to a position learning module to learn position information to obtain a dynamic information characteristic diagram, adding the dynamic information characteristic diagram and the static information characteristic diagram obtained after the output characteristic diagram of the first reference system is learned by the position learning module to obtain a characteristic representation with dynamic information and static information, sending the characteristic representation with the dynamic information and the static information to the position learning module to learn, sending the characteristic representation with the dynamic information and the static information to a decoder to perform characteristic decoding, and finally obtaining the subscript of the maximum value of the category prediction of each corresponding pixel point to obtain a final prediction mask;
step 2, designing a loss function, and training on a data set to obtain a video semantic segmentation model;
and 3, using a video semantic segmentation model to realize intelligent segmentation of the video.
Further, the time sequence feature encoder is divided into four time sequence feature encoding layers, wherein the first two layers are composed of time sequence feature residual blocks, and the second two layers are composed of time sequence feature random discarding residual blocks;
the first layer and the second layer of time sequence feature coding layers are respectively composed of K1 time sequence feature residual blocks and K2 time sequence feature residual blocks, and the third layer and the fourth layer of time sequence feature coding layers are respectively composed of K3 time sequence feature random discarding residual blocks and K4 time sequence feature random discarding residual blocks;
the time sequence characteristic residual block consists of a convolution layer, a layer normalization layer, a depth convolution layer, an activation layer and a convolution layer, wherein the characteristic diagram of the input time sequence characteristic residual block sequentially passes through the layers, and then the characteristic diagram adding operation is carried out on the characteristic diagram of the input time sequence characteristic residual block and the characteristic diagram of the residual branch to output the characteristic diagram; the time sequence characteristic random discarding residual block consists of a convolution layer, a layer normalization layer, an activation layer, a convolution layer and a random discarding layer, wherein the characteristic diagram of the input time sequence characteristic random discarding residual block sequentially passes through the first four layers, then the characteristic diagram adding operation is carried out through residual branches and the characteristic diagram of the input time sequence characteristic random discarding residual block, and then the characteristic diagram is output after passing through a random discarding layer;
furthermore, the RELU activation function is used by the activation layer, and the Drop path operation is used by the random discarding layer.
Further, a first 5×5 convolution layer of a first time sequence feature residual block in the first two time sequence feature coding layers of the time sequence feature coder sets a step length to be 2 to reduce the height and width of the feature graphs, at this time, a 2×2 convolution layer is used for reducing the height and width of the feature graphs in residual branches of the time sequence feature residual blocks to keep the consistency of the feature graphs when the feature graphs are added, and other time sequence feature residual blocks do not perform the operation; the first 7×7 convolution layer of the residual block is randomly discarded by the first time sequence feature in the two last time sequence feature coding layers of the time sequence feature coder, the step length is set to be 2 to reduce the height width of the feature images, at the moment, a 2×2 convolution layer is used for reducing the height width of the feature images on the residual branches of the time sequence feature residual blocks to keep the consistency of the feature images when the feature images are added, and other time sequence feature random discarding residual blocks do not do the operation.
Further, the specific processing procedure of the position learning module is as follows;
after the feature map is input to the position learning module, the feature map is divided into three branches and simultaneously subjected to feature map deformation operation, the feature map with the dimension of C multiplied by H multiplied by W is subjected to two-dimensional merging to be changed into C multiplied by H multiplied by W, then the first branch is subjected to dimension conversion to change the dimension of the feature map with the dimension of the first dimension and the second dimension into (H multiplied by W) multiplied by C, then the obtained matrix is subjected to matrix multiplication with the matrix on the third branch, and firstly, one (H multiplied by W) multiplied by H multiplied by W is obtained in the two operations, then a C multiplied by H multiplied by W matrix is obtained, and the matrix is deformed to obtain a tensor of C multiplied by H multiplied by W; finally, 1×1 convolution is performed to obtain 1×h×w, and then the feature map is added to the corresponding position of the feature map before the input position learning module, so as to obtain a final output result.
Further, the feature map output by the time sequence feature encoder of the first reference frame is used for extracting features through a 5×5 convolution, and the feature map of the first reference frame is output; the feature map output by the time sequence feature encoder of the second reference frame is used for extracting features through 7×7 convolution, and the feature map of the second reference frame is output. The feature map output by the time sequence feature encoder of the third reference frame is used for extracting features through 11×11 convolution, and the feature map of the third reference frame is output.
Further, the loss function designed in the step 2 is a position weighted loss function L p Loss L from two parts 1 And L 2 Composition, L 1 And L 2 The specific formula is as follows:
Figure SMS_1
Figure SMS_2
formula L 1 And L 2 Wherein C is the number of classes of pixels, N is the number of pixels in the mask, y ij Representing the i-th pixel for the j-th class of real labels, p ij Representing the predictive probability, alpha, of the jth class of the ith pixel j Different weights, w, are set for different categories j i Position weights are used for distributing different weights for pixels at different positions, epsilon is a minimum value and is used for avoiding the situation that denominator is 0; l (L) 1 And L 2 Composition position weighted loss function L p The formula of (2) is as follows:
Figure SMS_3
wherein λ is the weight that the loss weight is used to control the loss of the latter part; 1-L 2 I is for 1-L 2 The absolute value is partly calculated.
Further, alpha j The value of (2) is determined by the object to be segmented, and the weight assigned by the object easy to be segmented is smaller than that of other objects; w (w) i The value of (2) is determined by the position of the pixel in the image, and the position weight of the pixel in the middle of the image is greater than that of the pixel at the edge of the image.
In general, the above technical solutions conceived by the present invention, compared with the prior art, enable the following beneficial effects to be obtained:
(1) By improving the network structure, deepening the depth of the network and adding a random discarding layer in the deep layer of the network, the network overfitting can be prevented, and the learning ability and generalization of the network can be improved.
(2) A loss function is designed to simultaneously focus on the prediction of pixel level and the prediction of object edge information.
(3) A position learning module is designed, the correlation of the positions in the feature map is learned through matrix multiplication and convolution, and position weights are given to the feature map, so that the sensitivity of the network to dynamic information and static information is improved, and the segmentation accuracy is improved.
Drawings
Fig. 1 is a flowchart of a technical scheme of a video semantic segmentation method combining dynamic information and static information according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a timing sequence feature residual block structure of a video semantic segmentation method combining dynamic information and static information according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a sequential feature random discarding residual block structure of a video semantic segmentation method combining dynamic information and static information according to an embodiment of the present invention.
Fig. 4 is a schematic structural diagram of a position learning module of a video semantic segmentation method combining dynamic information and static information according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of a network framework of a video semantic segmentation method combining dynamic information and static information according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
Referring to fig. 1, fig. 1 is a flowchart of a technical scheme of a video semantic segmentation method combining dynamic information and static information, which is provided by an embodiment, specifically includes the following steps:
(1) Constructing a video semantic segmentation network architecture combining dynamic information and static information;
specifically, referring to fig. 5, fig. 5 is a schematic diagram of a network frame of a video semantic segmentation method based on dynamic information and static information combination according to an embodiment of the present invention.
First, the network is provided with 3 reference frames for processing the video frame at the current time T, the video frame at time T-1 and the video frame at time T-2, respectively. And in special cases, if the video frame at the current time T is a first frame, then the reference frames at time T-1 and time T-2 use the video frame at the current time T, and if the video frame at the current time T is a second frame, then the reference frames at time T-1 and time T-2 use the video frame at time T-1.
Second, each frame of reference of the network uses a temporal feature encoder to extract features, where the structure of the temporal feature encoders for the 3 frames of reference are identical. The feature map output by the time sequence feature encoder of the first reference frame is used for extracting features through a 5×5 convolution, and the feature map of the first reference frame is output. The feature map output by the time sequence feature encoder of the second reference frame is used for extracting features through 7×7 convolution, and the feature map of the second reference frame is output. The feature map output by the time sequence feature encoder of the third reference frame is used for extracting features through 11×11 convolution, and the feature map of the third reference frame is output. By integrating the information of the video frames at different times by using convolution of different scales, the convolution kernel used is larger as the time is farther, because the difference between the object to be segmented and the current time is larger, and the larger convolution kernel is needed for feature representation.
Finally, the output feature map of the second reference frame is spliced with the output feature map of the third reference frame, the obtained result is sent to a position learning module to learn position information to obtain a dynamic information feature map, the position learning module here, please refer to fig. 4, fig. 4 is a schematic diagram of a position learning module structure of a video semantic segmentation method based on combination of dynamic information and static information provided by an embodiment, the dynamic information feature map and the static information feature map obtained after the output feature map of the first reference frame is learned by a position learning model are added to obtain a feature representation with dynamic information and static information, the feature representation with dynamic information and static information is sent to a position learning module to learn and then sent to a decoder to perform feature decoding, and finally, the subscript of the prediction maximum value of each corresponding pixel point category is obtained to obtain a final prediction mask.
The decoder used here is a two-stage feature map decoding structure commonly used in the field of video segmentation.
Specifically, the timing characteristic encoder is composed of two kinds of residual blocks, wherein the two kinds of residual blocks are respectively a timing characteristic residual block and a timing characteristic random discarding residual block, please refer to fig. 2 and fig. 3, fig. 2 is a schematic diagram of a timing characteristic residual block structure of a video semantic segmentation method based on combination of dynamic information and static information provided by an embodiment, and fig. 3 is a schematic diagram of a timing characteristic random discarding residual block structure of a video semantic segmentation method based on combination of dynamic information and static information provided by an embodiment. The time sequence feature encoder can be divided into four time sequence feature encoding layers, wherein the first two layers are composed of time sequence feature residual blocks, and the second two layers are composed of time sequence feature random discarded residual blocks.
The first layer and the second layer of time sequence feature coding layer are respectively composed of 4 time sequence feature residual blocks and 6 time sequence feature residual blocks, the third layer and the fourth layer of time sequence feature coding layer are respectively composed of 9 time sequence feature random discarding residual blocks and 15 time sequence feature random discarding residual blocks, and the values are the best parameters determined through experiments.
Specifically, the specific operation of the location learning module: after the feature map is input to the position learning module, the feature map is divided into three branches and subjected to feature map deformation operation, the feature map with the dimension of C multiplied by H multiplied by W is subjected to dimension combination to be changed into C multiplied by H multiplied by W, then the first branch is subjected to dimension conversion to change the dimension of the feature map with the dimension of the first dimension and the second dimension into (H multiplied by W) multiplied by C, then the obtained matrix is subjected to matrix multiplication with the matrix on the third branch, and the two operations firstly obtain a (H multiplied by W) multiplied by H (H multiplied by W) matrix, then a C multiplied by H multiplied by W) matrix is obtained, and the matrix is deformed to obtain a tensor of C multiplied by H multiplied by W. Finally, 1×1 convolution is performed to obtain 1×h×w, and then the feature map is added to the corresponding position of the feature map before the input position learning module, so as to obtain a final output result.
Specifically, the timing characteristic residual block and the timing random discard residual block are specifically constituted. The time sequence characteristic residual block consists of a 5×5 convolution layer, a layer normalization layer, a 3×3 depth convolution layer, an activation layer and a 1×1 convolution layer, wherein the characteristic diagram of the input time sequence characteristic residual block sequentially passes through the layers, and then the characteristic diagram adding operation is carried out on the characteristic diagram of the input time sequence characteristic residual block and the characteristic diagram of the residual branch to output the characteristic diagram. The time sequence characteristic random discarding residual block consists of a 7×7 convolution layer, a layer normalization layer, an activation layer, a 1×1 convolution layer and a random discarding layer, wherein the characteristic diagram of the input time sequence characteristic random discarding residual block sequentially passes through the first four layers, then the characteristic diagram adding operation is carried out on the characteristic diagram of the residual block and the characteristic diagram of the input time sequence characteristic random discarding residual block through residual branches, and then the characteristic diagram is output after passing through a random discarding layer. The RELU activation function is used by the activation layer, and the Drop path operation is used by the random discard layer.
Specifically, the specific arrangement of two residual blocks in the timing characteristic encoder. The first 5×5 convolution layer of the first time sequence feature residual block in the first two time sequence feature coding layers of the time sequence feature coder sets the step length to be 2 to reduce the height width of the feature graphs, at this time, a 2×2 convolution layer is used for reducing the height width of the feature graphs to maintain the consistency of the feature graphs when the feature graphs are added in the residual branches of the time sequence feature residual blocks, and other time sequence feature residual blocks do not do so, so the reason is that: the step length of the feature coding layer is set to 2, so that the size of the feature map can be reduced, and other feature coding layers only need to learn features and do not need to reduce the size of the feature map. The first 7×7 convolution layer of the residual block is randomly discarded by the first time sequence feature in the two last time sequence feature coding layers of the time sequence feature coder, the step length is set to be 2 to reduce the height width of the feature map, at the moment, a 2×2 convolution layer is used for reducing the height width of the feature map to keep the consistency of the feature map when the feature maps are added in the residual branches of the time sequence feature residual block, and other time sequence feature random discarding residual blocks do not do the operation.
(2) Designing a loss function, and training on an urban landscape data set to obtain a video semantic segmentation model, wherein the urban landscape data set has 19 classifications, a label of a picture stores pixel values in the picture in a range of 0 to 18 in a single-channel image, and each classification corresponds to one pixel value to realize classification in a pixel dimension, which is commonly called a mask image;
specifically, the loss function designed in the step 2 is a position weighted loss number L p Loss L from two parts 1 And L 2 Composition, L 1 And L 2 The specific formula is as follows:
Figure SMS_4
Figure SMS_5
formula L 1 And L 2 Wherein C is the number of classes of pixels, N is the number of pixels in the mask, y ij Representing the i-th pixel for the j-th class of real labels, p ij Representing the predictive probability, alpha, of the jth class of the ith pixel j Different weights are set for different categories j, lower weights are allocated to objects which are easy to split, such as a background and a person, and other categories are allocated with larger weights according to experimental effects, and the ratio of the two weights is 9:10, w i Is a position weight, and pixels at different positions are assigned different weights, wherein the position weight in the middle of the image is greater than the weight of the image edge by 1.1 and 1 respectively, epsilon is a minimum value usually set to 0.0004, so as to avoid the situation that the denominator is 0. L (L) 1 And L 2 Composition position weighted loss function L p The formula of (2) is as follows:
Figure SMS_6
where λ is the weight of the loss weight used to control the loss of the latter part, and is typically set to 0.8. 1-L 2 I is for 1-L 2 The absolute value is partially calculated, and the partial loss is represented by L 2 While keeping 1-L as small as possible 2 As small as possible, the network training can be focused more on the accuracy of boundary pixel segmentation. Position weighted loss function L p Such a combination can be performed so that the network training is focused on the overall division situation and also focused on the division situation of the edge information.
(3) And using a video semantic segmentation model to realize intelligent segmentation of the video.
The invention provides a video semantic segmentation method combining dynamic information and static information, which can realize efficient segmentation of video by improving network results and designing a loss function, solves the problem that the video segmentation needs to be realized manually, and provides a high-accuracy video segmentation network construction strategy. Compared with the existing advanced video semantic segmentation method, the urban landscape data set is improved by 0.8% in average cross ratio index.
Various modifications and alterations to this application may be made by those skilled in the art without departing from the spirit and scope of this application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (8)

1. The video semantic segmentation method combining dynamic information and static information is characterized by comprising the following steps of:
step 1, constructing a video semantic segmentation network architecture combining dynamic information and static information;
the video semantic segmentation network architecture is provided with 3 reference frames which are respectively used for processing a video frame at the current moment T, a video frame at the moment T-1 and a video frame at the moment T-2; each reference frame uses a time sequence feature encoder to extract features and outputs a feature map of the corresponding reference frame through a convolution layer; splicing the output characteristic diagram of the second reference system with the output characteristic diagram of the third reference system, sending the obtained result to a position learning module to learn position information to obtain a dynamic information characteristic diagram, adding the dynamic information characteristic diagram with the static information characteristic diagram obtained after the output characteristic diagram of the first reference system is learned by the position learning module to obtain a characteristic representation with dynamic information and static information, sending the characteristic representation with the dynamic information and the static information to the position learning module to learn, sending the obtained characteristic representation with the dynamic information and the static information to a decoder to perform characteristic decoding, and finally obtaining the subscript of the maximum value of the prediction of each corresponding pixel point category to obtain a final prediction mask;
step 2, designing a loss function, and training on a data set to obtain a video semantic segmentation model;
and 3, using a video semantic segmentation model to realize intelligent segmentation of the video.
2. The video semantic segmentation method combining dynamic information and static information according to claim 1, wherein: the time sequence feature encoder is divided into four time sequence feature encoding layers, wherein the first two layers are composed of time sequence feature residual blocks, and the second two layers are composed of time sequence feature random discarded residual blocks;
the first layer and the second layer of time sequence feature coding layers are respectively composed of K1 time sequence feature residual blocks and K2 time sequence feature residual blocks, and the third layer and the fourth layer of time sequence feature coding layers are respectively composed of K3 time sequence feature random discarding residual blocks and K4 time sequence feature random discarding residual blocks;
the time sequence characteristic residual block consists of a convolution layer, a layer normalization layer, a depth convolution layer, an activation layer and a convolution layer, wherein the characteristic diagram of the input time sequence characteristic residual block sequentially passes through the layers, and then the characteristic diagram adding operation is carried out on the characteristic diagram of the input time sequence characteristic residual block and the characteristic diagram of the residual branch to output the characteristic diagram; the time sequence characteristic random discarding residual block consists of a convolution layer, a layer normalization layer, an activation layer, a convolution layer and a random discarding layer, wherein the characteristic diagram of the input time sequence characteristic random discarding residual block sequentially passes through the first four layers, then the characteristic diagram adding operation is carried out through the residual branch and the characteristic diagram of the input time sequence characteristic random discarding residual block, and then the characteristic diagram is output after passing through a random discarding layer.
3. The video semantic segmentation method combining dynamic information and static information according to claim 2, wherein: the RELU activation function is used in the activation layer, and the Drop path operation is used in the random discarding layer.
4. The video semantic segmentation method combining dynamic information and static information according to claim 2, wherein: a first 5 multiplied by 5 convolution layer of a first time sequence characteristic residual block in the first two time sequence characteristic coding layers of the time sequence characteristic coder is set to be 2 for reducing the height width of the characteristic diagram, at the moment, a 2 multiplied by 2 convolution layer is used for reducing the height width of the characteristic diagram in the residual branch of the time sequence characteristic residual block to keep the consistency of the size of the characteristic diagram when the characteristic diagrams are added, and other time sequence characteristic residual blocks do not do the operation; the first 7×7 convolution layer of the residual block is randomly discarded by the first time sequence feature in the two last time sequence feature coding layers of the time sequence feature coder, the step length is set to be 2 to reduce the height width of the feature images, at the moment, a 2×2 convolution layer is used for reducing the height width of the feature images on the residual branches of the time sequence feature residual blocks to keep the consistency of the feature images when the feature images are added, and other time sequence feature random discarding residual blocks do not do the operation.
5. The video semantic segmentation method combining dynamic information and static information according to claim 1, wherein: the specific processing procedure of the position learning module is as follows;
after the feature map is input to the position learning module, the feature map is divided into three branches and simultaneously subjected to feature map deformation operation, the feature map with the dimension of C multiplied by H multiplied by W is subjected to two-dimensional merging to be changed into C multiplied by H multiplied by W, then the first branch is subjected to dimension conversion to change the dimension of the feature map with the dimension of the first dimension and the second dimension into (H multiplied by W) multiplied by C, then the obtained matrix is subjected to matrix multiplication with the matrix on the third branch, and firstly, one (H multiplied by W) multiplied by H multiplied by W is obtained in the two operations, then a C multiplied by H multiplied by W matrix is obtained, and the matrix is deformed to obtain a tensor of C multiplied by H multiplied by W; finally, 1×1 convolution is performed to obtain 1×h×w, and then the feature map is added to the corresponding position of the feature map before the input position learning module, so as to obtain a final output result.
6. The video semantic segmentation method combining dynamic information and static information according to claim 1, wherein: extracting features from the feature map output by the time sequence feature encoder of the first reference frame through a 5×5 convolution, and outputting the feature map of the first reference frame; extracting features from the feature map output by the time sequence feature encoder of the second reference frame through 7×7 convolution, and outputting the feature map of the second reference frame; the feature map output by the time sequence feature encoder of the third reference frame is used for extracting features through 11×11 convolution, and the feature map of the third reference frame is output.
7. A video combining dynamic information with static information as recited in claim 1The semantic segmentation method is characterized by comprising the following steps of: the loss function designed in the step 2 is a position weighted loss function L p Loss L from two parts 1 And L 2 Composition, L 1 And L 2 The specific formula is as follows:
Figure QLYQS_1
Figure QLYQS_2
formula L 1 And L 2 Wherein C is the number of classes of pixels, N is the number of pixels in the mask, y ij Representing the i-th pixel for the j-th class of real labels, p ij Representing the predictive probability, alpha, of the jth class of the ith pixel j Different weights, w, are set for different categories j i Position weights are used for distributing different weights for pixels at different positions, epsilon is a minimum value and is used for avoiding the situation that denominator is 0; l (L) 1 And L 2 Composition position weighted loss function L p The formula of (2) is as follows:
Figure QLYQS_3
wherein λ is the weight that the loss weight is used to control the loss of the latter part; 1-L 2 I is for 1-L 2 The absolute value is partly calculated.
8. The video semantic segmentation method combining dynamic information and static information according to claim 7, wherein: alpha j The value of (2) is determined by the object to be segmented, and the weight assigned by the object easy to be segmented is smaller than that of other objects; w (w) i The value of (2) is determined by the position of the pixel in the image, and the position weight of the pixel in the middle of the image is greater than that of the pixel at the edge of the image.
CN202310536770.7A 2023-05-12 2023-05-12 Video semantic segmentation method combining dynamic information and static information Active CN116246075B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310536770.7A CN116246075B (en) 2023-05-12 2023-05-12 Video semantic segmentation method combining dynamic information and static information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310536770.7A CN116246075B (en) 2023-05-12 2023-05-12 Video semantic segmentation method combining dynamic information and static information

Publications (2)

Publication Number Publication Date
CN116246075A true CN116246075A (en) 2023-06-09
CN116246075B CN116246075B (en) 2023-07-21

Family

ID=86633542

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310536770.7A Active CN116246075B (en) 2023-05-12 2023-05-12 Video semantic segmentation method combining dynamic information and static information

Country Status (1)

Country Link
CN (1) CN116246075B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200092552A1 (en) * 2018-09-18 2020-03-19 Google Llc Receptive-Field-Conforming Convolutional Models for Video Coding
CN111050219A (en) * 2018-10-12 2020-04-21 奥多比公司 Spatio-temporal memory network for locating target objects in video content
CN111062395A (en) * 2019-11-27 2020-04-24 北京理工大学 Real-time video semantic segmentation method
CN111652899A (en) * 2020-05-29 2020-09-11 中国矿业大学 Video target segmentation method of space-time component diagram
CN113570610A (en) * 2021-07-26 2021-10-29 北京百度网讯科技有限公司 Method and device for performing target segmentation on video by adopting semantic segmentation model
CN114596520A (en) * 2022-02-09 2022-06-07 天津大学 First visual angle video action identification method and device
CN114663460A (en) * 2022-02-28 2022-06-24 华南农业大学 Video segmentation method and device based on double-current driving encoder and feature memory module
CN114973071A (en) * 2022-05-11 2022-08-30 中国科学院软件研究所 Unsupervised video target segmentation method and system based on long-term and short-term time sequence characteristics
US20230035475A1 (en) * 2021-07-16 2023-02-02 Huawei Technologies Co., Ltd. Methods and systems for semantic segmentation of a point cloud

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200092552A1 (en) * 2018-09-18 2020-03-19 Google Llc Receptive-Field-Conforming Convolutional Models for Video Coding
CN111050219A (en) * 2018-10-12 2020-04-21 奥多比公司 Spatio-temporal memory network for locating target objects in video content
CN111062395A (en) * 2019-11-27 2020-04-24 北京理工大学 Real-time video semantic segmentation method
CN111652899A (en) * 2020-05-29 2020-09-11 中国矿业大学 Video target segmentation method of space-time component diagram
US20230035475A1 (en) * 2021-07-16 2023-02-02 Huawei Technologies Co., Ltd. Methods and systems for semantic segmentation of a point cloud
CN113570610A (en) * 2021-07-26 2021-10-29 北京百度网讯科技有限公司 Method and device for performing target segmentation on video by adopting semantic segmentation model
CN114596520A (en) * 2022-02-09 2022-06-07 天津大学 First visual angle video action identification method and device
CN114663460A (en) * 2022-02-28 2022-06-24 华南农业大学 Video segmentation method and device based on double-current driving encoder and feature memory module
CN114973071A (en) * 2022-05-11 2022-08-30 中国科学院软件研究所 Unsupervised video target segmentation method and system based on long-term and short-term time sequence characteristics

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
SIYUE YU: "Fast pixel-matching for video object segmentation", 《SIGNAL PROCESSING:IMAGE COMMUNICATION》, vol. 98, pages 3 - 5 *
余锋: "基于多头软注意力图卷积网络的行人轨迹预测", 《计算机应用》, vol. 43, no. 03, pages 736 - 743 *
余锋: "针对多姿态迁移的虚拟试衣算法研究", 《武汉纺织大学学报》, vol. 35, no. 01, pages 3 - 9 *
景庄伟;管海燕;彭代峰;于永涛;: "基于深度神经网络的图像语义分割研究综述", 《计算机工程》, vol. 46, no. 10, pages 1 - 17 *
王婷婷: "面向道路交通场景的行人智能检测方法研究", 《中国优秀硕士学位论文全文数据库(工程科技Ⅱ辑), no. 02, pages 034 - 1395 *

Also Published As

Publication number Publication date
CN116246075B (en) 2023-07-21

Similar Documents

Publication Publication Date Title
CN113052210B (en) Rapid low-light target detection method based on convolutional neural network
CN106960206B (en) Character recognition method and character recognition system
CN111368636B (en) Object classification method, device, computer equipment and storage medium
CN112801027B (en) Vehicle target detection method based on event camera
CN112634296A (en) RGB-D image semantic segmentation method and terminal for guiding edge information distillation through door mechanism
CN111738169A (en) Handwriting formula recognition method based on end-to-end network model
CN112084859A (en) Building segmentation method based on dense boundary block and attention mechanism
CN117197763A (en) Road crack detection method and system based on cross attention guide feature alignment network
CN113901924A (en) Document table detection method and device
CN113436198A (en) Remote sensing image semantic segmentation method for collaborative image super-resolution reconstruction
CN117975418A (en) Traffic sign detection method based on improved RT-DETR
CN111429468B (en) Cell nucleus segmentation method, device, equipment and storage medium
CN116246075B (en) Video semantic segmentation method combining dynamic information and static information
CN118230323A (en) Semantic segmentation method for fusing space detail context and multi-scale interactive image
CN116597339A (en) Video target segmentation method based on mask guide semi-dense contrast learning
CN114782995A (en) Human interaction behavior detection method based on self-attention mechanism
CN115205518A (en) Target detection method and system based on YOLO v5s network structure
Wang et al. Research on Semantic Segmentation Algorithm for Multiscale Feature Images Based on Improved DeepLab v3+
CN114758279B (en) Video target detection method based on time domain information transfer
CN118155219A (en) Text detection and recognition method for terminal strip coding tube based on deep learning
CN118965058A (en) Language expression-based arbitrary category counting model and counting method thereof
CN118968051A (en) Urban scene real-time semantic segmentation method based on gating alignment network
CN118918326A (en) Urban scene real-time semantic segmentation method based on bilateral fusion network
CN115661676A (en) Building segmentation system and method based on serial attention module and parallel attention module
CN118038042A (en) Infrared image segmentation method and device based on visible light image enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant