CN118747797B

CN118747797B - Image processing method and device based on deep learning

Info

Publication number: CN118747797B
Application number: CN202411098789.9A
Authority: CN
Inventors: 李柏蕤; 连荷清; 武静威
Original assignee: Beijing Xiaofei Technology Co ltd
Current assignee: Beijing Xiaofei Technology Co ltd
Priority date: 2024-08-12
Filing date: 2024-08-12
Publication date: 2024-11-01
Anticipated expiration: 2044-08-12
Also published as: CN118747797A

Abstract

The invention provides an image processing method and device based on deep learning, wherein the method comprises the following steps: acquiring image data to be detected, and fusing the image data with input prompt information to obtain input data; inputting the input data into a pre-constructed target detection model to obtain a target detection result output by the target detection model; the target detection model is obtained by training a natural image sample, prompt information corresponding to the natural image sample and tag information based on a pre-constructed deep learning network. The method solves the technical problems that in the prior art, target detection lacks interactive capability in the image processing process and the image processing effect is limited.

Description

Image processing method and device based on deep learning

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image processing method and apparatus based on deep learning.

Background

Target detection is One of important means of image processing, for example, in the medical field, a Two-Stage target detection model such as RCNN or an One-Stage target detection model such as YOLO is adopted in conventional target detection. When in target detection, the Two-Stage target detection models such as RCNN and the like use candidate areas generated in One Stage, classify and carry out boundary regression on each candidate area in Two stages, and the One-Stage target detection models such as YOLO and the like adopt a One-Stage model structure. In the target detection process, the prior art lacks interactive capability when detecting the image target, can not realize detection based on the prompt target, and limits the effect and application range of image processing.

Disclosure of Invention

The invention provides an image processing method and device based on deep learning, which are used for solving the technical problems that in the prior art, target detection lacks interactive capability and image processing effect is limited.

The invention provides an image processing method based on deep learning, which comprises the following steps:

acquiring image data to be detected, and fusing the image data with input prompt information to obtain input data;

inputting the input data into a pre-constructed target detection model to obtain a target detection result output by the target detection model;

The target detection model is obtained by training a natural image sample, prompt information corresponding to the natural image sample and tag information based on a pre-constructed deep learning network.

In some embodiments, training is performed based on a pre-constructed deep learning network by using a natural image sample, prompt information corresponding to the natural image sample, and tag information, so as to obtain the target detection model, which specifically includes:

constructing a data set, wherein the data set comprises a natural image sample, prompt information corresponding to the natural image sample and label information;

dividing the data set into a training set, a verification set and a test set;

Inputting samples in the training set into a pre-constructed deep learning network for training to obtain an initial detection model;

And respectively verifying and testing the initial detection model by using a verification set and a test set to obtain the target detection model.

In some embodiments, the constructing the data set specifically includes:

collecting a mass of natural image samples to establish a gallery;

Generating a mask image with equal resolution based on each natural image sample to establish a mask library;

labeling a target to be labeled in a non-mask area on a natural image sample to generate label information;

The natural image sample, the mask image corresponding to the natural image sample, and the tag information of the non-mask area thereof are taken as one unit to construct a data set having a plurality of units.

In some embodiments, the network architecture of the deep learning network comprises:

The image encoder is used for extracting semantic information from an input natural image sample and obtaining a characteristic diagram of the natural image sample through multiple downsampling;

The prompt information encoder is used for extracting semantic information from the input prompt information, and obtaining a feature map fused with the prompt information through information fusion and processing;

The feature fusion module is used for carrying out feature fusion on the feature graphs with multiple sizes;

and the label matching module is used for determining label information of the sample by utilizing the minimum cost loss and establishing a relation between the image information and the text description by using contrast learning.

In some embodiments, the hint information includes mask information and text information, and the hint information encoder includes:

The mask encoder is used for extracting semantic information from an input mask image and obtaining a feature map of the mask image through multiple downsampling;

The fusion layer is used for fusing the feature images of the natural image sample with the feature images of the mask image;

The text encoder is used for extracting high-level semantic information of the input text information, and establishing association between image features and text features through contrast learning;

in some embodiments, the feature fusion module includes co-scale feature fusion and trans-scale feature fusion.

The invention also provides an image processing device based on deep learning, which comprises:

The data acquisition unit is used for acquiring image data to be detected, and fusing the image data with the input prompt information to obtain input data;

The result generation unit is used for inputting the input data into a pre-constructed target detection model so as to obtain a target detection result output by the target detection model;

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as described above when executing the program.

The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described above.

The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

According to the image processing method and device based on deep learning, the image data to be detected are obtained and fused with the input prompt information to obtain the input data; inputting the input data into a pre-constructed target detection model to obtain a target detection result output by the target detection model; the target detection model is obtained by training a natural image sample, prompt information corresponding to the natural image sample and tag information based on a pre-constructed deep learning network. The method and the device provided by the invention can obviously improve the capability of the detection model for realizing interaction with human beings based on the prompt information by fusing the prompt information in the model training, have a certain practical use value, and solve the technical problems of lack of interaction capability and limited image processing effect of target detection in the image processing process in the prior art.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is one of the flowcharts of the image processing method based on deep learning provided by the present invention;

FIG. 2 is a second flowchart of an image processing method based on deep learning according to the present invention;

FIG. 3 is a third flowchart of an image processing method based on deep learning according to the present invention;

fig. 4 is a block diagram of the image processing apparatus based on deep learning provided by the present invention;

fig. 5 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

With the rapid development of deep learning networks, the performance of a target detection model is greatly improved, but with the increasing demands of people for interactive artificial intelligence systems, the technical field of target detection has seldom been researched about visual understanding tasks. Recently, the success of large language models has demonstrated the importance of modern artificial intelligence models in human interaction; the interactive segmentation model provides an example for the research of visual understanding tasks; some target detection techniques avoid manually designed components, such as anchor definition, label allocation, post-processing of a prediction frame, etc., to complete target detection tasks end-to-end; still other multimodal models associate text and image features. Accordingly, the invention provides a target detection method based on deep learning and capable of interaction, and the detection of the prompted target is realized through prompting of points, frames, texts, masks and the like, so that the interaction capability of a model is improved, and the practical use value is enhanced.

In a specific embodiment, the image processing method based on deep learning provided by the invention comprises the following steps:

S110: acquiring image data to be detected, and fusing the image data with input prompt information to obtain input data;

s120: inputting the input data into a pre-constructed target detection model to obtain a target detection result output by the target detection model;

In some embodiments, as shown in fig. 2, training is performed based on a pre-constructed deep learning network by using a natural image sample, prompt information corresponding to the natural image sample, and tag information, so as to obtain the target detection model, and specifically includes the following steps:

S210: constructing a data set, wherein the data set comprises a natural image sample, prompt information corresponding to the natural image sample and label information;

S220: dividing the data set into a training set, a verification set and a test set;

S230: inputting samples in the training set into a pre-constructed deep learning network for training to obtain an initial detection model; network model: based on DINO frames, by introducing mask coding branches and text coding branches, the interaction capability of the model for prompting of points, frames, texts, masks and the like is improved;

s240: and respectively verifying and testing the initial detection model by using a verification set and a test set to obtain the target detection model.

In step S210, the constructing the data set specifically includes:

collecting a mass of natural image samples to establish a gallery;

Specifically, in the process of constructing a data set, firstly, collecting a large number of natural images (including natural images contained in the data set such as COCO, VOC and the like), and establishing a gallery; generating a mask image with equal resolution for a natural image, wherein the mask image is a single-channel gray image, and a mask library is established; labeling targets to be labeled (including categories, bounding boxes and high-quality text descriptions) in a natural image non-mask area, and generating label information; taking a natural image, a mask image corresponding to the natural image and tag information of a non-mask area of the natural image as a basic unit, and according to 8:1:1, which is divided into a training set, a verification set and a test set.

And the image encoder is used for extracting semantic information from the input natural image sample and obtaining a characteristic diagram of the natural image sample through multiple downsampling. Specifically, the image encoder, MAE pre-trained Vision Transformer (ViT) is used as the image encoder to extract the high-level semantic information of the input natural image, so as to obtain feature images of 8 times, 16 times and 32 times downsampling the input natural image, wherein the feature images of different levels can be expressed in an abstract way as Wherein，The number of the feature maps for different levels;

The feature fusion module is used for carrying out feature fusion on the feature graphs with multiple sizes, and comprises co-scale feature fusion and trans-scale feature fusion. The feature fusion module aims to improve the detection capability of the model on targets with different sizes through feature fusion with multiple sizes; specifically, in order to reduce the calculation amount of the model and consider the lack of semantic information of the low-level feature map and the risk of confusion with semantic information of the high-level feature map, the feature fusion module is decoupled into co-scale feature fusion and trans-scale feature fusion, and the co-scale feature fusion can be expressed as:

；

In the above-mentioned method, the step of, Represent the firstFeature maps of different levels after fusion of natural image features and mask image features,Represent the firstQuery corresponding to feature images of different levels after fusion of natural image features and mask image features,Represent the firstKey corresponding to feature images of different levels after fusion of natural image features and mask image features,Represent the firstValues corresponding to feature images of different levels after fusion of natural image features and mask image features,Represents a flattening operation, and the flattening device,Representing a multi-headed self-attention operation,Representation recoveryDimension before operation;

the cross-scale feature fusion adopts a feature pyramid structure, and feature fusion paths from top to bottom and from bottom to top enrich the position information of the high-level feature map and the semantic information of the low-level feature map, so that the detection capability of the model on targets with different sizes is improved.

The label matching module is used for determining label information of the sample by utilizing the minimum cost loss, establishing classification and regression loss and realizing target detection; in addition, in order to further improve the interactive capability of the model, a relationship between the image information and the text description is established by using contrast learning; using contrast learning to establish a relationship between image information and text descriptions can be expressed as:

；

In the above-mentioned method, the step of, AndIndicating the loss of vision and text angle respectively,Representing the total loss; representing a set of text features within a batch size; representing a set of image features within a batch size; And Is thatAndIs a subset of (a); representing a set of text features corresponding to the same category labels based on the image features; representing a set of image features corresponding to the same category labels based on text features; representing cosine similarity of image features and text features.

The mask encoder is used for extracting semantic information from an input mask image and obtaining a feature map of the mask image through multiple downsampling; specifically, resnet of the mask encoder COCO pre-training is used as a mask encoder to extract high-level semantic information of an input mask image, so as to obtain feature graphs of 8 times, 16 times and 32 times downsampling of the input mask image, wherein the feature graphs of different levels can be expressed in an abstract way as follows Wherein; The operation formula of the fusion of the natural image features and the mask image features can be expressed as follows:

；

Wherein the method comprises the steps of Representing a fully connected layer; Feature graphs representing post-fusion unequal ratings, and ；

And the text encoder is used for extracting high-level semantic information of the input text information, and establishing association between the image characteristics and the text characteristics through contrast learning.

The text encoder adopts a CLIP pre-trained text encoder to extract advanced semantic information of an input text, and aims to establish the connection between image features and text features through contrast learning, so that the capability of a model for searching a target position based on text prompt is improved; specifically, a batch of natural images is first randomly sampledMatching mask images corresponding theretoAnd text description in tag information，Represent the firstNatural image of the firstA textual description of the individual object(s),Represents a batch size; then, natural imageMask imageAnd text descriptionRespectively sent to image encoderMask encoderAnd text encoderThe image features, mask features, and text features are obtained and can be expressed as:

；

In the above-mentioned method, the step of, AndRespectively represent natural imagesAnd mask imageThe corresponding feature maps of different levels,Represent the firstNatural image of the firstText features of text descriptions corresponding to individual objects, andAndThe number of channels isThe dimensions of the dimensions,The dimension of (2) is C dimension;

The operation formula of the fusion of the natural image features and the mask image features can be expressed as follows:

；

In the above-mentioned method, the step of, Represent the firstAnd (3) fusing natural image features and mask image features to obtain feature images with different levels, wherein the number of channels is C dimension.

In the model optimization process, a deep learning workstation is established, wherein the name of an adopted operating system is ubuntu 20.04.04, and hardware is configured into an Intel Core i9 processor, a 64GB memory and 1 NVIDIA GeForce RTX 3090; the configuration file of the model is based on DINO structures, a contrast denoising module is reserved, unstable tag matching phenomenon of the Hungary algorithm in early model training is avoided through the positive and negative query matching problem of the prior model, a cross attention module based on multi-scale deformable convolution is introduced into the decoder, and the capacity of acquiring relevant content information of the model based on position information is improved.

In an actual use scene, as shown in fig. 3, DINO structures provide convenience for the introduction of hints such as points, frames, texts, masks and the like, and detection of the hinted targets is realized; specifically, in the test stage, the query position provides an interface for initializing physical size information such as points, frames and the like, and the interesting image characteristics are obtained through a cross attention module by relying on the position information; the mask encoder provides objective conditions for the introduction of mask information, and the supervised learning of the training stage enables the model to automatically shield the image features of the mask region, so that the query obtains the image features of the non-mask region which are interested in the non-mask region in the testing stage; the text encoder relies on contrast learning to establish a link between text features and image features.

In the above specific embodiment, according to the image processing method based on deep learning provided by the present invention, by acquiring image data to be detected, the image data is fused with the input prompt information, so as to obtain input data; inputting the input data into a pre-constructed target detection model to obtain a target detection result output by the target detection model; the target detection model is obtained by training a natural image sample, prompt information corresponding to the natural image sample and tag information based on a pre-constructed deep learning network. According to the method provided by the invention, the prompt information is fused in the model training, so that the capability of the detection model for realizing interaction with human beings based on the prompt information can be remarkably improved, a certain practical use value is realized, and the technical problems of lack of interaction capability and limited image processing effect of target detection in the image processing process in the prior art are solved.

In addition to the above method, the present invention also provides an image processing apparatus based on deep learning, as shown in fig. 4, the apparatus comprising:

the data acquisition unit 410 is configured to acquire image data to be detected, and fuse the image data with the input prompt information to obtain input data;

A result generating unit 420, configured to input the input data into a pre-constructed target detection model, so as to obtain a target detection result output by the target detection model;

In some embodiments, training is performed based on a pre-constructed deep learning network by using a natural image sample, a mask image sample corresponding to the natural image sample, and tag information to obtain the target detection model, and specifically includes:

dividing the data set into a training set, a verification set and a test set;

In some embodiments, the constructing the data set specifically includes:

collecting a mass of natural image samples to establish a gallery;

In the above specific embodiment, the image processing device based on deep learning provided by the present invention obtains the image data to be detected, and fuses the image data with the input prompt information to obtain the input data; inputting the input data into a pre-constructed target detection model to obtain a target detection result output by the target detection model; the target detection model is obtained by training a natural image sample, prompt information corresponding to the natural image sample and tag information based on a pre-constructed deep learning network. The device provided by the invention can remarkably improve the capability of the detection model for realizing interaction with human beings based on the prompt information by fusing the prompt information in model training, has a certain practical use value, and solves the technical problems of lack of interaction capability and limited image processing effect of target detection in the image processing process in the prior art.

Fig. 5 illustrates a physical schematic diagram of an electronic device, as shown in fig. 5, which may include: processor 510, communication interface (Communications Interface) 520, memory 530, and communication bus 540, wherein processor 510, communication interface 520, memory 530 complete communication with each other through communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform the methods described above.

Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the invention also provides a computer program product comprising a computer program storable on a non-transitory computer readable storage medium, the computer program being executable by a processor to perform the method as described above.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An image processing method based on deep learning, the method comprising:

the target detection model is obtained by training a natural image sample, prompt information corresponding to the natural image sample and label information based on a pre-constructed deep learning network;

the network architecture of the deep learning network comprises:

the label matching module is used for determining label information of the sample by utilizing minimum cost loss, and establishing a relation between the image information and the text description by using contrast learning;

The hint information includes mask information and text information, and the hint information encoder includes:

the feature fusion module comprises co-scale feature fusion and trans-scale feature fusion.

2. The deep learning-based image processing method according to claim 1, wherein training is performed based on a pre-built deep learning network using a natural image sample, prompt information corresponding to the natural image sample, and tag information to obtain the target detection model, and specifically comprising:

dividing the data set into a training set, a verification set and a test set;

3. The image processing method based on deep learning according to claim 2, wherein the constructing the data set specifically comprises:

collecting a mass of natural image samples to establish a gallery;

4. An image processing apparatus based on deep learning, based on the method according to any of claims 1-3, characterized in that the apparatus comprises:

5. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 3 when the program is executed by the processor.

6. A non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the method of any one of claims 1 to 3.

7. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1 to 3.