CN118397516A

CN118397516A - Method and device for constructing video water body segmentation model based on mask self-encoder

Info

Publication number: CN118397516A
Application number: CN202410823092.7A
Authority: CN
Inventors: 郑彬彬; 产思贤; 王地; 张金艳; 陈栢宇
Original assignee: Hangzhou Shengbei Software Technology Co ltd
Current assignee: Hangzhou Shengbei Software Technology Co ltd
Priority date: 2024-06-25
Filing date: 2024-06-25
Publication date: 2024-07-26
Anticipated expiration: 2044-06-25
Also published as: CN118397516B

Abstract

The application provides a method and a device for constructing a video water body segmentation model based on a mask self-encoder, wherein the method comprises the following steps of: acquiring at least one video frame sequence containing water body as a training sample, inputting the training sample into a video water body segmentation architecture, and coding each frame in the video frame sequence by using a scene coding module in the video water body segmentation architecture to respectively obtain multi-scale images corresponding to each frame; obtaining a rough water body segmentation map of the current frame by using the space-time characteristic aggregation module, and refining the rough water body segmentation map by using the mask self-coding-decoding module to obtain a fine water body segmentation map; and predicting a fine water body segmentation map of the initial frame based on the acquired multiple fine water body segmentation maps and calculating loss so as to obtain a video water body segmentation model. The scheme designs the mask self-encoder and trains the video water body segmentation model in a single-frame supervision mode, so that the video labeling cost is reduced and the video water body segmentation precision is improved.

Description

Method and device for constructing video water body segmentation model based on mask self-encoder

Technical Field

The application relates to the technical field of image processing, in particular to a method and a device for constructing a video water body segmentation model based on a mask self-encoder.

Background

Video water segmentation is an important task in the field of computer vision, aims at automatically identifying and separating a water area from a video sequence, has wide application value in the aspects of environment monitoring, natural disaster early warning, water resource management and the like, and along with popularization of remote sensing technology and unmanned aerial vehicle monitoring, the acquisition of continuous water area video data becomes easier, but how to efficiently and accurately process the data and extract valuable information becomes a problem to be solved urgently.

However, the complexity of the natural environment causes huge appearance change of the water body under different illumination and weather conditions, so that the segmentation difficulty is increased, and secondly, the traditional video water body segmentation method depends on deep learning and other technologies, and the technologies have strong performance, but usually require a complex model structure to combine a large amount of annotation data for training, so that the application in actual scenes is limited.

In view of the foregoing, there is a need for a method that can achieve high-precision water segmentation in the absence of a large amount of marker data with a simple model structure, so as to better apply the method in a practical scenario.

Disclosure of Invention

The embodiment of the application provides a method and a device for constructing a video water body segmentation model based on a mask self-encoder, which are used for designing the mask self-encoder, training the video water body segmentation model in a single-frame supervision mode, and only marking an initial frame so as to reduce the marking cost of a video and improve the precision of video water body segmentation in a self-supervision learning mode.

In a first aspect, an embodiment of the present application provides a method for constructing a video water body segmentation model based on a mask self-encoder, where the method includes:

Acquiring at least one video frame sequence containing water body as a training sample, marking the water body of an initial frame in each video frame sequence, and inputting the initial frame into a constructed video water body segmentation architecture;

The video water body segmentation architecture comprises a scene coding module, a space-time characteristic aggregation module and a mask self-coding-decoding module, wherein the scene coding module codes each frame in a video frame sequence to respectively obtain a multi-scale image corresponding to each frame;

The time-space feature aggregation module takes an input current frame as a query, takes a multi-scale image corresponding to a previous frame of the current frame as a key, and takes a prediction segmentation result corresponding to the previous frame of the current frame as a value to perform attention calculation to obtain a rough water segmentation map of the current frame, wherein if the previous frame of the current frame is an initial frame, water labeling information of the initial frame is taken as a prediction segmentation result of the initial frame;

Obtaining the water body characteristics of the current frame based on the rough water body segmentation map of the current frame, the multi-scale image corresponding to the current frame and the current frame in the mask self-encoding-decoding module, gradually fusing the water body characteristics of the current frame with the multi-scale image corresponding to the current frame to obtain a fine water body segmentation map of the current frame, and taking the fine water body segmentation map of the current frame as a prediction segmentation result of the next frame to obtain a fine water body segmentation map of the next frame until the fine water body segmentation map of each frame image except the initial frame is obtained;

The space-time feature aggregation module inquires an initial frame, respectively takes a set number of fine water body segmentation graphs as keys, performs attention calculation by taking a multi-scale image corresponding to the keys as a value to obtain a rough water body segmentation graph of the initial frame, and refines the rough water body segmentation graph of the initial frame by the mask self-coding-decoding module to obtain the fine water body segmentation graph of the initial frame;

And constructing a loss function by using the fine water body segmentation map of the initial frame and the water body labeling information of the initial frame, and retaining parameters of the current video water body segmentation architecture when the loss function meets set conditions to obtain a video water body segmentation model.

In a second aspect, an embodiment of the present application provides a video water body segmentation method, including:

And acquiring a video frame sequence to be segmented, and inputting the video frame sequence to be segmented into a trained video water body segmentation model to obtain a water body segmentation result.

In a third aspect, an embodiment of the present application provides a device for constructing a video water body segmentation model based on a mask self-encoder, including:

the acquisition module is used for acquiring at least one video frame sequence containing water body as a training sample, labeling the water body of an initial frame in each video frame sequence and inputting the labeled water body into the constructed video water body segmentation architecture;

The mask self-encoding-decoding module obtains the water body characteristics of the current frame based on the rough water body segmentation map of the current frame, the multi-scale image corresponding to the current frame and the current frame, fuses the water body characteristics of the current frame and the multi-scale image corresponding to the current frame step by step to obtain a fine water body segmentation map of the current frame, and takes the fine water body segmentation map of the current frame as a prediction segmentation result of the next frame to obtain a fine water body segmentation map of the next frame until the fine water body segmentation map of each frame image except the initial frame is obtained;

the initial frame prediction module is used for inquiring the initial frame by the space-time characteristic aggregation module, respectively taking a set number of fine water body segmentation graphs as keys, carrying out attention calculation by taking a multi-scale image corresponding to the keys as a value to obtain a rough water body segmentation graph of the initial frame, and refining the rough water body segmentation graph of the initial frame by the mask self-coding-decoding module to obtain a fine water body segmentation graph of the initial frame;

The loss calculation module is used for constructing a loss function by using the fine water body segmentation map of the initial frame and the water body labeling information of the initial frame, and when the loss function meets the set condition, the parameters of the current video water body segmentation architecture are reserved to obtain a video water body segmentation model.

In a fourth aspect, embodiments of the present application provide an electronic device comprising a memory and a processor, the memory having stored therein a computer program, the processor being arranged to run the computer program to perform a method of constructing a mask-based self-encoder video water segmentation model or a method of video water segmentation.

In a fifth aspect, embodiments of the present application provide a readable storage medium having stored therein a computer program comprising program code for controlling a process to perform a process comprising a method of constructing a video water segmentation model based on a mask-self encoder or a method of video water segmentation.

The main contributions and innovation points of the invention are as follows:

according to the embodiment of the application, the video frame sequence is used as a training sample to be input into the video water body segmentation architecture, and the video water body segmentation architecture predicts the segmentation result of the current frame by using the prediction segmentation result of the previous frame through the space-time feature aggregation module, so that the scheme only needs to label the initial frame in the video frame sequence when labeling, and the labor cost of labeling is greatly reduced; the scheme designs a mask self-coding-decoding module to refine the rough water body segmentation map, thereby ensuring the accuracy of water body segmentation prediction of the current frame, and taking the fine water body segmentation map of the current frame as the prediction segmentation result of the next frame to obtain the fine water body segmentation map of the next frame, so that the water body segmentation result with high accuracy can be obtained under the condition of a small number of labels by using a small sample in a self-supervision mode.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a flow chart of a method of constructing a mask-based self-encoder video water segmentation model in accordance with an embodiment of the present application;

FIG. 2 is a block diagram of a scene coding module according to an embodiment of the application;

FIG. 3 is a block diagram of a mask self-encoding-decoding module according to an embodiment of the present application;

FIG. 4 is an overall flow diagram of a video water segmentation model according to an embodiment of the present application;

FIG. 5 is a block diagram of a construction apparatus for a mask-based self-encoder video water segmentation model according to an embodiment of the present application;

fig. 6 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with aspects of one or more embodiments of the present description as detailed in the accompanying claims.

It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments.

Example 1

The embodiment of the application provides a method for constructing a video water body segmentation model based on a mask self-encoder, which is designed and trains the video water body segmentation model in a single-frame supervision mode, so that the video annotation cost is reduced and the video water body segmentation precision is improved, and concretely, referring to fig. 1, the method comprises the following steps:

In the scheme, the video frame sequence in the training sample is subjected to pretreatment of random overturn, luminosity distortion, random affine transformation and random clipping and scaling, wherein the random overturn is the overturn of the video frame sequence trained by each batch in the horizontal direction with the probability of 50%, the luminosity distortion is the adjustment of the brightness, the chromaticity, the contrast and the saturation of the video frame sequence, the random affine transformation is the random rotation, the translation, the miscut and the scaling under the condition that the center of the video frame sequence is unchanged, the random clipping and scaling is the clipping and scaling of the video frame sequence to 384×384 with the random area size so as to meet the input requirement of a video water segmentation architecture.

In this embodiment, the number of frames in each video frame sequence in the training sample is a set number, and in some embodiments, the set number is 8.

In the step of encoding each frame in the video frame sequence by the scene encoding module to obtain a multi-scale image corresponding to each frame, the scene encoder is connected with a plurality of residual structures by a convolution layer, and the multi-scale image comprises feature images with different sizes output by each residual structure.

In some embodiments, the structure of the scene encoder is shown in fig. 2, where the scene encoder includes a first residual block, a second residual block, and a third residual block, and a convolution layer in the scene encoder processes an input current frame and sequentially inputs the processed current frame into the first residual block, the second residual block, and the third residual block, where a feature map output by the first residual block has a size of 96×96×256, a feature map output by the second residual block has a size of 48×48×512, and a feature map output by the third residual block has a size of 24×24×1024.

That is, the multi-scale features in this approach include 3 feature maps of different sizes.

In this scheme, the space-time feature aggregation module includes a memory bank for storing the multi-scale feature map corresponding to each frame output by the scene coding module, so as to facilitate the subsequent calculation of the multi-scale features.

In the step of "taking the multi-scale image corresponding to the previous frame of the current frame as a key", the feature map with the smallest resolution in the multi-scale image corresponding to the previous frame of the current frame is taken as a key.

Specifically, the multi-scale image contains a plurality of feature images with different sizes, but the smaller the resolution of the feature images is, the larger the channel number of the feature images is, that is, the smaller the resolution of the image can contain more semantic information, so that the image is more suitable for being used as a key for water body segmentation, and the feature image with the minimum resolution in the multi-scale image is used as the key for improving the accuracy of water body segmentation.

Further, the prediction segmentation result is downsampled to be the same size as a feature map with minimum resolution in the multi-scale image and then used as a value.

Specifically, in the step of performing attention calculation with a prediction segmentation result corresponding to the previous frame of the current frame as a value to obtain a rough water segmentation map of the current frame, the result of the attention calculation is up-sampled to obtain a rough water segmentation map, and the size of the rough water segmentation map is the same as the size of an image in a video frame sequence.

Specifically, since the key and the value size adopted in the scheme are feature graphs with the size of 24×24 when the attention calculation is performed, the obtained attention calculation result is also 24×24, so that the water body features in the video frame cannot be well restored by directly using the attention calculation result, and the attention result is restored to the image size in the video frame sequence in an up-sampling mode for better obtaining the water body features in the video frame.

In this scheme, the structure of the mask self-encoding-decoding module is shown in fig. 3, the mask self-encoding-decoding module includes a mask encoding unit and a mask decoding unit, the mask encoding unit is formed by concatenating an encoding convolution layer, a plurality of encoding residual blocks and a feature encoding fusion module, the current frame and a rough water segmentation map of the current frame are spliced and then input into the mask encoding unit, a first feature map is output after passing through the encoding convolution layer and the plurality of encoding residual blocks, the size of the first feature map is the same as the feature map with the minimum resolution in the multi-scale image, the number of encoding residual blocks is the same as the number of residual blocks in the scene encoding module, and the feature encoding fusion module fuses the first feature map with the minimum resolution in the multi-scale image by using spatial attention and channel attention operations to obtain the first water feature map.

The mask decoding unit is connected in series by a feature decoding fusion unit, a plurality of upsampling layers and a classification head, the feature decoding fusion unit is identical to the feature encoding fusion unit, the feature decoding fusion unit uses spatial attention and channel attention operations to fuse the first water feature map with the minimum resolution in the multi-scale image again to obtain a second water feature map, the upsampling layers are used for gradually upsampling the second water feature map, after each upsampling is finished, the upsampling result is used for fusing the feature map with the same size as the upsampling result in the multi-scale image to obtain a third water feature map, and the classification head uses bilinear interpolation upsampling operations to output and process the third water feature map to obtain a fine water segmentation map of the current frame.

That is, the number of up-sampling layers in the mask decoding unit is 1 less than the number of residual blocks in the scene coding module.

Taking the number of residual blocks in the scene encoder as 3 and the sizes of feature images respectively output by each residual block as 96×96, 48×48 and 24×24 as an example, the number of residual blocks in the mask encoding unit is also 3 and the size of a first feature image output by the last residual block is 24×24, and then fusing the first feature image with the minimum resolution in the multi-scale image by the feature encoding fusion module to obtain a first water feature image.

The number of up-sampling layers in the mask decoding unit is 2, and the scaling factor is 2, the size of the second water feature map output by the feature decoding fusion unit is 24×24, the second water feature map is up-sampled to 48×48 by one up-sampling layer and then fused with the feature map of 48×48 in the multi-scale feature map, the second up-sampling layer is up-sampled to 96×96 and then fused with the feature map of 96×96 in the multi-scale feature map to obtain a third water feature map, and the classification head uses bilinear interpolation up-sampling operation to perform classification output in a four-time up-sampling mode to obtain a fine water segmentation map of 384×384, that is, the size of the fine water segmentation map output by the scheme is the same as the image size input by the video water segmentation architecture.

In the scheme, the fine water body segmentation map of the current frame is taken as the prediction segmentation result of the next frame to obtain the fine water body segmentation map of the next frame, and the fine water body segmentation map of each frame of image except the initial frame is also obtained through the method, so that detailed steps are not repeated.

In the scheme, a set number of fine water body segmentation graphs are acquired in a random sampling mode, each fine water body segmentation graph is downsampled to be the same as a feature graph with minimum resolution in the multi-scale feature graph, each fine water body segmentation graph and the feature graph with minimum resolution in the corresponding multi-scale feature graph form a key value pair, so that a rough water body segmentation graph of an initial frame is obtained, the rough water body segmentation graph of the initial frame is refined by the mask self-coding-decoding module, and the fine water body segmentation graph of the initial frame is obtained, and the overall flow of the video water body segmentation model is shown in fig. 4.

In the scheme, the cross entropy loss and dice loss of the fine water body segmentation map of the initial frame and the water body labeling information of the initial frame are calculated, and the specific formula is as follows:

Wherein, Water labeling information representing an initial frame, a value of 1 or 0,For the fine water segmentation map of the initial frame, the scheme reduces loss through back propagation before the training of each batch is finished, and simultaneously updates network parameters and starts the training of the next batch until all batches of training are finished or the loss is not reduced any more, and then the current parameters of the video water segmentation architecture are saved.

Example two

A video water segmentation method, comprising:

Example III

Based on the same conception, referring to fig. 5, the application also provides a device for constructing a video water body segmentation model based on a mask self-encoder, which comprises the following steps:

Example III

This embodiment also provides an electronic device, referring to fig. 6, comprising a memory 404 and a processor 402, the memory 404 having stored therein a computer program, the processor 402 being arranged to run the computer program to perform the steps of any of the method embodiments described above.

In particular, the processor 402 may include a Central Processing Unit (CPU), or an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, abbreviated as ASIC), or may be configured as one or more integrated circuits that implement embodiments of the present application.

The memory 404 may include, among other things, mass storage 404 for data or instructions. By way of example, and not limitation, memory 404 may comprise a hard disk drive (HARDDISKDRIVE, abbreviated HDD), a floppy disk drive, a solid state drive (SolidStateDrive, abbreviated SSD), flash memory, an optical disk, a magneto-optical disk, a magnetic tape, or a Universal Serial Bus (USB) drive, or a combination of two or more of these. Memory 404 may include removable or non-removable (or fixed) media, where appropriate. Memory 404 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 404 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, memory 404 includes Read-only memory (ROM) and Random Access Memory (RAM). Where appropriate, the ROM may be a mask-programmed ROM, a programmable ROM (ProgrammableRead-only memory, abbreviated PROM), an erasable PROM (ErasableProgrammableRead-only memory, abbreviated EPROM), an electrically erasable PROM (ElectricallyErasableProgrammableRead-only memory, abbreviated EEPROM), an electrically rewritable ROM (ElectricallyAlterableRead-only memory, abbreviated EAROM) or a FLASH memory (FLASH), or a combination of two or more of these. The RAM may be a static random access memory (StaticRandom-access memory, abbreviated SRAM) or a dynamic random access memory (DynamicRandomAccessMemory, abbreviated DRAM) where the DRAM may be a fast page mode dynamic random access memory 404 (FastPageModeDynamicRandomAccessMemory, abbreviated FPMDRAM), an extended data output dynamic random access memory (ExtendedDateOutDynamicRandomAccessMemory, abbreviated EDODRAM), a synchronous dynamic random access memory (SynchronousDynamicRandom-access memory, abbreviated SDRAM), or the like, where appropriate.

Memory 404 may be used to store or cache various data files that need to be processed and/or used for communication, as well as possible computer program instructions for execution by processor 402.

Processor 402 implements the method of constructing a mask-based self-encoder video water segmentation model of any of the above embodiments by reading and executing computer program instructions stored in memory 404.

Optionally, the electronic apparatus may further include a transmission device 406 and an input/output device 408, where the transmission device 406 is connected to the processor 402 and the input/output device 408 is connected to the processor 402.

The transmission device 406 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wired or wireless network provided by a communication provider of the electronic device. In one example, the transmission device includes a network adapter (Network Interface Controller, simply referred to as a NIC) that can connect to other network devices through the base station to communicate with the internet. In one example, the transmission device 406 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.

The input-output device 408 is used to input or output information. In this embodiment, the input information may be a video frame sequence including a water body, and the output information may be a water body segmentation result, and the like.

Alternatively, in the present embodiment, the above-mentioned processor 402 may be configured to execute the following steps by a computer program:

S101, acquiring at least one video frame sequence containing water body as a training sample, marking the water body of an initial frame in each video frame sequence, and inputting the initial frame into a constructed video water body segmentation architecture;

S102, the video water body segmentation architecture comprises a scene coding module, a space-time characteristic aggregation module and a mask self-coding-decoding module, wherein the scene coding module codes each frame in a video frame sequence to respectively obtain a multi-scale image corresponding to each frame;

S103, the space-time feature aggregation module takes an input current frame as a query, takes a multi-scale image corresponding to a previous frame of the current frame as a key, and takes a prediction segmentation result corresponding to the previous frame of the current frame as a value to perform attention calculation to obtain a rough water segmentation map of the current frame, wherein if the previous frame of the current frame is an initial frame, water labeling information of the initial frame is taken as a prediction segmentation result of the initial frame;

S104, obtaining the water body characteristics of the current frame based on the rough water body segmentation map of the current frame, the multi-scale image corresponding to the current frame and the current frame in the mask self-encoding-decoding module, gradually fusing the water body characteristics of the current frame with the multi-scale image corresponding to the current frame to obtain a fine water body segmentation map of the current frame, and taking the fine water body segmentation map of the current frame as a prediction segmentation result of the next frame to obtain a fine water body segmentation map of the next frame until obtaining a fine water body segmentation map of each frame image except the initial frame;

S105, the space-time characteristic aggregation module inquires the initial frames, the preset number of fine water body segmentation graphs are used as keys, attention calculation is carried out by taking multi-scale images corresponding to the keys as values to obtain rough water body segmentation graphs of the initial frames, and the mask self-coding-decoding module refines the rough water body segmentation graphs of the initial frames to obtain fine water body segmentation graphs of the initial frames;

S106, constructing a loss function by using the fine water body segmentation map of the initial frame and the water body labeling information of the initial frame, and retaining parameters of the current video water body segmentation architecture when the loss function meets set conditions to obtain a video water body segmentation model.

It should be noted that, specific examples in this embodiment may refer to examples described in the foregoing embodiments and alternative implementations, and this embodiment is not repeated herein.

In general, the various embodiments may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. Some aspects of the invention may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

Embodiments of the invention may be implemented by computer software executable by a data processor of a mobile device, such as in a processor entity, or by hardware, or by a combination of software and hardware. Computer software or programs (also referred to as program products) including software routines, applets, and/or macros can be stored in any apparatus-readable data storage medium and they include program instructions for performing particular tasks. The computer program product may include one or more computer-executable components configured to perform embodiments when the program is run. The one or more computer-executable components may be at least one software code or a portion thereof. In this regard, it should also be noted that any block of the logic flow as in fig. 6 may represent a program step, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on a physical medium such as a memory chip or memory block implemented within a processor, a magnetic medium such as a hard disk or floppy disk, and an optical medium such as, for example, a DVD and its data variants, a CD, etc. The physical medium is a non-transitory medium.

It should be understood by those skilled in the art that the technical features of the above embodiments may be combined in any manner, and for brevity, all of the possible combinations of the technical features of the above embodiments are not described, however, they should be considered as being within the scope of the description provided herein, as long as there is no contradiction between the combinations of the technical features.

The foregoing examples illustrate only a few embodiments of the application, which are described in greater detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. The method for constructing the video water body segmentation model based on the mask self-encoder is characterized by comprising the following steps of:

2. The method according to claim 1, wherein in the step of encoding each frame in the sequence of video frames by the scene encoding module to obtain a multi-scale image corresponding to each frame, the scene encoder is connected to a plurality of residual structures by a convolution layer, and the multi-scale image includes feature maps of different sizes output by each residual structure.

3. The method according to claim 1, wherein in the step of "taking as a key a multi-scale image corresponding to a frame immediately preceding a current frame", a feature map with a minimum resolution in the multi-scale image corresponding to the frame immediately preceding the current frame is taken as a key.

4. The method for constructing a mask-based self-encoder video water segmentation model according to claim 1, wherein the prediction segmentation result is downsampled to the same size as a feature map with minimum resolution in a multi-scale image and then used as a value.

5. The method for constructing the video water segmentation model based on the mask self-encoder according to claim 1, wherein the mask self-encoding-decoding module comprises a mask encoding unit and a mask decoding unit, the mask encoding unit is formed by concatenating an encoding convolution layer, a plurality of encoding residual blocks and a feature encoding fusion module, the rough water segmentation map of the current frame and the current frame is firstly spliced and then is input into the mask encoding unit, a first feature map is output after passing through the encoding convolution layer and the encoding residual blocks, the size of the first feature map is the same as the feature map with the minimum resolution in the multi-scale image, the number of the encoding residual blocks is the same as the number of the residual blocks in the scene encoding module, and the feature encoding fusion module fuses the first feature map with the minimum resolution in the multi-scale image by using spatial attention and channel attention operations to obtain the first water feature map.

6. The method for constructing the video water segmentation model based on the mask self-encoder according to claim 5, wherein the mask decoding unit is connected in series by a feature decoding fusion unit, a plurality of upsampling layers and a classification head, the feature decoding fusion unit is identical to the feature encoding fusion unit, the feature decoding fusion unit uses spatial attention and channel attention operations to fuse a first water feature map with a feature map with minimum resolution in a multiscale image again to obtain a second water feature map, the upsampling layers are used for gradually upsampling the second water feature map, the feature map with the same size as the upsampling result in the multiscale image is fused to obtain a third water feature map after each upsampling is finished, and the classification head uses bilinear interpolation upsampling operations to output the third water feature map to obtain a fine water segmentation map of the current frame.

7. A method for video water segmentation, comprising:

Obtaining a video frame sequence to be segmented, and inputting the video frame sequence to be segmented into the video water segmentation model trained in claim 1 to obtain a water segmentation result.

8. A device for constructing a video water body segmentation model based on a mask self-encoder is characterized in that,

9. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, the processor being arranged to run the computer program to perform a method of constructing a mask-based self-encoder video water segmentation model as claimed in any one of claims 1-6 or a method of video water segmentation as claimed in claim 7.

10. A readable storage medium, wherein a computer program is stored in the readable storage medium, the computer program comprising program code for controlling a process to perform a process comprising a method of constructing a mask-based self-encoder video water segmentation model according to any one of claims 1-6 or a video water segmentation method according to claim 7.