CN118747797B - Image processing method and device based on deep learning - Google Patents
Image processing method and device based on deep learning Download PDFInfo
- Publication number
- CN118747797B CN118747797B CN202411098789.9A CN202411098789A CN118747797B CN 118747797 B CN118747797 B CN 118747797B CN 202411098789 A CN202411098789 A CN 202411098789A CN 118747797 B CN118747797 B CN 118747797B
- Authority
- CN
- China
- Prior art keywords
- information
- image
- target detection
- natural image
- mask
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000013135 deep learning Methods 0.000 title claims abstract description 45
- 238000003672 processing method Methods 0.000 title claims abstract description 15
- 238000001514 detection method Methods 0.000 claims abstract description 83
- 238000012549 training Methods 0.000 claims abstract description 33
- 238000000034 method Methods 0.000 claims abstract description 29
- 238000012545 processing Methods 0.000 claims abstract description 21
- 230000004927 fusion Effects 0.000 claims description 41
- 238000012360 testing method Methods 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 11
- 238000012795 verification Methods 0.000 claims description 9
- 238000010586 diagram Methods 0.000 claims description 7
- 238000002372 labelling Methods 0.000 claims description 5
- 230000000694 effects Effects 0.000 abstract description 6
- 230000002452 interceptive effect Effects 0.000 abstract description 6
- 230000003993 interaction Effects 0.000 description 10
- 238000004891 communication Methods 0.000 description 6
- 241000282414 Homo sapiens Species 0.000 description 4
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 201000004569 Blindness Diseases 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 208000018769 loss of vision Diseases 0.000 description 1
- 231100000864 loss of vision Toxicity 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000004393 visual impairment Effects 0.000 description 1
Landscapes
- Image Analysis (AREA)
Abstract
The invention provides an image processing method and device based on deep learning, wherein the method comprises the following steps: acquiring image data to be detected, and fusing the image data with input prompt information to obtain input data; inputting the input data into a pre-constructed target detection model to obtain a target detection result output by the target detection model; the target detection model is obtained by training a natural image sample, prompt information corresponding to the natural image sample and tag information based on a pre-constructed deep learning network. The method solves the technical problems that in the prior art, target detection lacks interactive capability in the image processing process and the image processing effect is limited.
Description
Technical Field
The present invention relates to the field of image processing technologies, and in particular, to an image processing method and apparatus based on deep learning.
Background
Target detection is One of important means of image processing, for example, in the medical field, a Two-Stage target detection model such as RCNN or an One-Stage target detection model such as YOLO is adopted in conventional target detection. When in target detection, the Two-Stage target detection models such as RCNN and the like use candidate areas generated in One Stage, classify and carry out boundary regression on each candidate area in Two stages, and the One-Stage target detection models such as YOLO and the like adopt a One-Stage model structure. In the target detection process, the prior art lacks interactive capability when detecting the image target, can not realize detection based on the prompt target, and limits the effect and application range of image processing.
Disclosure of Invention
The invention provides an image processing method and device based on deep learning, which are used for solving the technical problems that in the prior art, target detection lacks interactive capability and image processing effect is limited.
The invention provides an image processing method based on deep learning, which comprises the following steps:
acquiring image data to be detected, and fusing the image data with input prompt information to obtain input data;
inputting the input data into a pre-constructed target detection model to obtain a target detection result output by the target detection model;
The target detection model is obtained by training a natural image sample, prompt information corresponding to the natural image sample and tag information based on a pre-constructed deep learning network.
In some embodiments, training is performed based on a pre-constructed deep learning network by using a natural image sample, prompt information corresponding to the natural image sample, and tag information, so as to obtain the target detection model, which specifically includes:
constructing a data set, wherein the data set comprises a natural image sample, prompt information corresponding to the natural image sample and label information;
dividing the data set into a training set, a verification set and a test set;
Inputting samples in the training set into a pre-constructed deep learning network for training to obtain an initial detection model;
And respectively verifying and testing the initial detection model by using a verification set and a test set to obtain the target detection model.
In some embodiments, the constructing the data set specifically includes:
collecting a mass of natural image samples to establish a gallery;
Generating a mask image with equal resolution based on each natural image sample to establish a mask library;
labeling a target to be labeled in a non-mask area on a natural image sample to generate label information;
The natural image sample, the mask image corresponding to the natural image sample, and the tag information of the non-mask area thereof are taken as one unit to construct a data set having a plurality of units.
In some embodiments, the network architecture of the deep learning network comprises:
The image encoder is used for extracting semantic information from an input natural image sample and obtaining a characteristic diagram of the natural image sample through multiple downsampling;
The prompt information encoder is used for extracting semantic information from the input prompt information, and obtaining a feature map fused with the prompt information through information fusion and processing;
The feature fusion module is used for carrying out feature fusion on the feature graphs with multiple sizes;
and the label matching module is used for determining label information of the sample by utilizing the minimum cost loss and establishing a relation between the image information and the text description by using contrast learning.
In some embodiments, the hint information includes mask information and text information, and the hint information encoder includes:
The mask encoder is used for extracting semantic information from an input mask image and obtaining a feature map of the mask image through multiple downsampling;
The fusion layer is used for fusing the feature images of the natural image sample with the feature images of the mask image;
The text encoder is used for extracting high-level semantic information of the input text information, and establishing association between image features and text features through contrast learning;
in some embodiments, the feature fusion module includes co-scale feature fusion and trans-scale feature fusion.
The invention also provides an image processing device based on deep learning, which comprises:
The data acquisition unit is used for acquiring image data to be detected, and fusing the image data with the input prompt information to obtain input data;
The result generation unit is used for inputting the input data into a pre-constructed target detection model so as to obtain a target detection result output by the target detection model;
The target detection model is obtained by training a natural image sample, prompt information corresponding to the natural image sample and tag information based on a pre-constructed deep learning network.
The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as described above when executing the program.
The invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described above.
The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.
According to the image processing method and device based on deep learning, the image data to be detected are obtained and fused with the input prompt information to obtain the input data; inputting the input data into a pre-constructed target detection model to obtain a target detection result output by the target detection model; the target detection model is obtained by training a natural image sample, prompt information corresponding to the natural image sample and tag information based on a pre-constructed deep learning network. The method and the device provided by the invention can obviously improve the capability of the detection model for realizing interaction with human beings based on the prompt information by fusing the prompt information in the model training, have a certain practical use value, and solve the technical problems of lack of interaction capability and limited image processing effect of target detection in the image processing process in the prior art.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is one of the flowcharts of the image processing method based on deep learning provided by the present invention;
FIG. 2 is a second flowchart of an image processing method based on deep learning according to the present invention;
FIG. 3 is a third flowchart of an image processing method based on deep learning according to the present invention;
fig. 4 is a block diagram of the image processing apparatus based on deep learning provided by the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
With the rapid development of deep learning networks, the performance of a target detection model is greatly improved, but with the increasing demands of people for interactive artificial intelligence systems, the technical field of target detection has seldom been researched about visual understanding tasks. Recently, the success of large language models has demonstrated the importance of modern artificial intelligence models in human interaction; the interactive segmentation model provides an example for the research of visual understanding tasks; some target detection techniques avoid manually designed components, such as anchor definition, label allocation, post-processing of a prediction frame, etc., to complete target detection tasks end-to-end; still other multimodal models associate text and image features. Accordingly, the invention provides a target detection method based on deep learning and capable of interaction, and the detection of the prompted target is realized through prompting of points, frames, texts, masks and the like, so that the interaction capability of a model is improved, and the practical use value is enhanced.
In a specific embodiment, the image processing method based on deep learning provided by the invention comprises the following steps:
S110: acquiring image data to be detected, and fusing the image data with input prompt information to obtain input data;
s120: inputting the input data into a pre-constructed target detection model to obtain a target detection result output by the target detection model;
The target detection model is obtained by training a natural image sample, prompt information corresponding to the natural image sample and tag information based on a pre-constructed deep learning network.
In some embodiments, as shown in fig. 2, training is performed based on a pre-constructed deep learning network by using a natural image sample, prompt information corresponding to the natural image sample, and tag information, so as to obtain the target detection model, and specifically includes the following steps:
S210: constructing a data set, wherein the data set comprises a natural image sample, prompt information corresponding to the natural image sample and label information;
S220: dividing the data set into a training set, a verification set and a test set;
S230: inputting samples in the training set into a pre-constructed deep learning network for training to obtain an initial detection model; network model: based on DINO frames, by introducing mask coding branches and text coding branches, the interaction capability of the model for prompting of points, frames, texts, masks and the like is improved;
s240: and respectively verifying and testing the initial detection model by using a verification set and a test set to obtain the target detection model.
In step S210, the constructing the data set specifically includes:
collecting a mass of natural image samples to establish a gallery;
Generating a mask image with equal resolution based on each natural image sample to establish a mask library;
labeling a target to be labeled in a non-mask area on a natural image sample to generate label information;
The natural image sample, the mask image corresponding to the natural image sample, and the tag information of the non-mask area thereof are taken as one unit to construct a data set having a plurality of units.
Specifically, in the process of constructing a data set, firstly, collecting a large number of natural images (including natural images contained in the data set such as COCO, VOC and the like), and establishing a gallery; generating a mask image with equal resolution for a natural image, wherein the mask image is a single-channel gray image, and a mask library is established; labeling targets to be labeled (including categories, bounding boxes and high-quality text descriptions) in a natural image non-mask area, and generating label information; taking a natural image, a mask image corresponding to the natural image and tag information of a non-mask area of the natural image as a basic unit, and according to 8:1:1, which is divided into a training set, a verification set and a test set.
In some embodiments, the network architecture of the deep learning network comprises:
And the image encoder is used for extracting semantic information from the input natural image sample and obtaining a characteristic diagram of the natural image sample through multiple downsampling. Specifically, the image encoder, MAE pre-trained Vision Transformer (ViT) is used as the image encoder to extract the high-level semantic information of the input natural image, so as to obtain feature images of 8 times, 16 times and 32 times downsampling the input natural image, wherein the feature images of different levels can be expressed in an abstract way as Wherein,The number of the feature maps for different levels;
The prompt information encoder is used for extracting semantic information from the input prompt information, and obtaining a feature map fused with the prompt information through information fusion and processing;
The feature fusion module is used for carrying out feature fusion on the feature graphs with multiple sizes, and comprises co-scale feature fusion and trans-scale feature fusion. The feature fusion module aims to improve the detection capability of the model on targets with different sizes through feature fusion with multiple sizes; specifically, in order to reduce the calculation amount of the model and consider the lack of semantic information of the low-level feature map and the risk of confusion with semantic information of the high-level feature map, the feature fusion module is decoupled into co-scale feature fusion and trans-scale feature fusion, and the co-scale feature fusion can be expressed as:
;
In the above-mentioned method, the step of, Represent the firstFeature maps of different levels after fusion of natural image features and mask image features,Represent the firstQuery corresponding to feature images of different levels after fusion of natural image features and mask image features,Represent the firstKey corresponding to feature images of different levels after fusion of natural image features and mask image features,Represent the firstValues corresponding to feature images of different levels after fusion of natural image features and mask image features,Represents a flattening operation, and the flattening device,Representing a multi-headed self-attention operation,Representation recoveryDimension before operation;
the cross-scale feature fusion adopts a feature pyramid structure, and feature fusion paths from top to bottom and from bottom to top enrich the position information of the high-level feature map and the semantic information of the low-level feature map, so that the detection capability of the model on targets with different sizes is improved.
The label matching module is used for determining label information of the sample by utilizing the minimum cost loss, establishing classification and regression loss and realizing target detection; in addition, in order to further improve the interactive capability of the model, a relationship between the image information and the text description is established by using contrast learning; using contrast learning to establish a relationship between image information and text descriptions can be expressed as:
;
In the above-mentioned method, the step of, AndIndicating the loss of vision and text angle respectively,Representing the total loss; representing a set of text features within a batch size; representing a set of image features within a batch size; And Is thatAndIs a subset of (a); representing a set of text features corresponding to the same category labels based on the image features; representing a set of image features corresponding to the same category labels based on text features; representing cosine similarity of image features and text features.
In some embodiments, the hint information includes mask information and text information, and the hint information encoder includes:
The mask encoder is used for extracting semantic information from an input mask image and obtaining a feature map of the mask image through multiple downsampling; specifically, resnet of the mask encoder COCO pre-training is used as a mask encoder to extract high-level semantic information of an input mask image, so as to obtain feature graphs of 8 times, 16 times and 32 times downsampling of the input mask image, wherein the feature graphs of different levels can be expressed in an abstract way as follows Wherein; The operation formula of the fusion of the natural image features and the mask image features can be expressed as follows:
;
Wherein the method comprises the steps of Representing a fully connected layer; Feature graphs representing post-fusion unequal ratings, and ;
The fusion layer is used for fusing the feature images of the natural image sample with the feature images of the mask image;
And the text encoder is used for extracting high-level semantic information of the input text information, and establishing association between the image characteristics and the text characteristics through contrast learning.
The text encoder adopts a CLIP pre-trained text encoder to extract advanced semantic information of an input text, and aims to establish the connection between image features and text features through contrast learning, so that the capability of a model for searching a target position based on text prompt is improved; specifically, a batch of natural images is first randomly sampledMatching mask images corresponding theretoAnd text description in tag information,Represent the firstNatural image of the firstA textual description of the individual object(s),Represents a batch size; then, natural imageMask imageAnd text descriptionRespectively sent to image encoderMask encoderAnd text encoderThe image features, mask features, and text features are obtained and can be expressed as:
;
In the above-mentioned method, the step of, AndRespectively represent natural imagesAnd mask imageThe corresponding feature maps of different levels,Represent the firstNatural image of the firstText features of text descriptions corresponding to individual objects, andAndThe number of channels isThe dimensions of the dimensions,The dimension of (2) is C dimension;
The operation formula of the fusion of the natural image features and the mask image features can be expressed as follows:
;
In the above-mentioned method, the step of, Represent the firstAnd (3) fusing natural image features and mask image features to obtain feature images with different levels, wherein the number of channels is C dimension.
In the model optimization process, a deep learning workstation is established, wherein the name of an adopted operating system is ubuntu 20.04.04, and hardware is configured into an Intel Core i9 processor, a 64GB memory and 1 NVIDIA GeForce RTX 3090; the configuration file of the model is based on DINO structures, a contrast denoising module is reserved, unstable tag matching phenomenon of the Hungary algorithm in early model training is avoided through the positive and negative query matching problem of the prior model, a cross attention module based on multi-scale deformable convolution is introduced into the decoder, and the capacity of acquiring relevant content information of the model based on position information is improved.
In an actual use scene, as shown in fig. 3, DINO structures provide convenience for the introduction of hints such as points, frames, texts, masks and the like, and detection of the hinted targets is realized; specifically, in the test stage, the query position provides an interface for initializing physical size information such as points, frames and the like, and the interesting image characteristics are obtained through a cross attention module by relying on the position information; the mask encoder provides objective conditions for the introduction of mask information, and the supervised learning of the training stage enables the model to automatically shield the image features of the mask region, so that the query obtains the image features of the non-mask region which are interested in the non-mask region in the testing stage; the text encoder relies on contrast learning to establish a link between text features and image features.
In the above specific embodiment, according to the image processing method based on deep learning provided by the present invention, by acquiring image data to be detected, the image data is fused with the input prompt information, so as to obtain input data; inputting the input data into a pre-constructed target detection model to obtain a target detection result output by the target detection model; the target detection model is obtained by training a natural image sample, prompt information corresponding to the natural image sample and tag information based on a pre-constructed deep learning network. According to the method provided by the invention, the prompt information is fused in the model training, so that the capability of the detection model for realizing interaction with human beings based on the prompt information can be remarkably improved, a certain practical use value is realized, and the technical problems of lack of interaction capability and limited image processing effect of target detection in the image processing process in the prior art are solved.
In addition to the above method, the present invention also provides an image processing apparatus based on deep learning, as shown in fig. 4, the apparatus comprising:
the data acquisition unit 410 is configured to acquire image data to be detected, and fuse the image data with the input prompt information to obtain input data;
A result generating unit 420, configured to input the input data into a pre-constructed target detection model, so as to obtain a target detection result output by the target detection model;
The target detection model is obtained by training a natural image sample, prompt information corresponding to the natural image sample and tag information based on a pre-constructed deep learning network.
In some embodiments, training is performed based on a pre-constructed deep learning network by using a natural image sample, a mask image sample corresponding to the natural image sample, and tag information to obtain the target detection model, and specifically includes:
constructing a data set, wherein the data set comprises a natural image sample, prompt information corresponding to the natural image sample and label information;
dividing the data set into a training set, a verification set and a test set;
Inputting samples in the training set into a pre-constructed deep learning network for training to obtain an initial detection model;
And respectively verifying and testing the initial detection model by using a verification set and a test set to obtain the target detection model.
In some embodiments, the constructing the data set specifically includes:
collecting a mass of natural image samples to establish a gallery;
Generating a mask image with equal resolution based on each natural image sample to establish a mask library;
labeling a target to be labeled in a non-mask area on a natural image sample to generate label information;
The natural image sample, the mask image corresponding to the natural image sample, and the tag information of the non-mask area thereof are taken as one unit to construct a data set having a plurality of units.
In some embodiments, the network architecture of the deep learning network comprises:
The image encoder is used for extracting semantic information from an input natural image sample and obtaining a characteristic diagram of the natural image sample through multiple downsampling;
The prompt information encoder is used for extracting semantic information from the input prompt information, and obtaining a feature map fused with the prompt information through information fusion and processing;
The feature fusion module is used for carrying out feature fusion on the feature graphs with multiple sizes;
and the label matching module is used for determining label information of the sample by utilizing the minimum cost loss and establishing a relation between the image information and the text description by using contrast learning.
In some embodiments, the hint information includes mask information and text information, and the hint information encoder includes:
The mask encoder is used for extracting semantic information from an input mask image and obtaining a feature map of the mask image through multiple downsampling;
The fusion layer is used for fusing the feature images of the natural image sample with the feature images of the mask image;
The text encoder is used for extracting high-level semantic information of the input text information, and establishing association between image features and text features through contrast learning;
in some embodiments, the feature fusion module includes co-scale feature fusion and trans-scale feature fusion.
In the above specific embodiment, the image processing device based on deep learning provided by the present invention obtains the image data to be detected, and fuses the image data with the input prompt information to obtain the input data; inputting the input data into a pre-constructed target detection model to obtain a target detection result output by the target detection model; the target detection model is obtained by training a natural image sample, prompt information corresponding to the natural image sample and tag information based on a pre-constructed deep learning network. The device provided by the invention can remarkably improve the capability of the detection model for realizing interaction with human beings based on the prompt information by fusing the prompt information in model training, has a certain practical use value, and solves the technical problems of lack of interaction capability and limited image processing effect of target detection in the image processing process in the prior art.
Fig. 5 illustrates a physical schematic diagram of an electronic device, as shown in fig. 5, which may include: processor 510, communication interface (Communications Interface) 520, memory 530, and communication bus 540, wherein processor 510, communication interface 520, memory 530 complete communication with each other through communication bus 540. Processor 510 may invoke logic instructions in memory 530 to perform the methods described above.
Further, the logic instructions in the memory 530 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, the invention also provides a computer program product comprising a computer program storable on a non-transitory computer readable storage medium, the computer program being executable by a processor to perform the method as described above.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (7)
1. An image processing method based on deep learning, the method comprising:
acquiring image data to be detected, and fusing the image data with input prompt information to obtain input data;
inputting the input data into a pre-constructed target detection model to obtain a target detection result output by the target detection model;
the target detection model is obtained by training a natural image sample, prompt information corresponding to the natural image sample and label information based on a pre-constructed deep learning network;
the network architecture of the deep learning network comprises:
The image encoder is used for extracting semantic information from an input natural image sample and obtaining a characteristic diagram of the natural image sample through multiple downsampling;
The prompt information encoder is used for extracting semantic information from the input prompt information, and obtaining a feature map fused with the prompt information through information fusion and processing;
The feature fusion module is used for carrying out feature fusion on the feature graphs with multiple sizes;
the label matching module is used for determining label information of the sample by utilizing minimum cost loss, and establishing a relation between the image information and the text description by using contrast learning;
The hint information includes mask information and text information, and the hint information encoder includes:
The mask encoder is used for extracting semantic information from an input mask image and obtaining a feature map of the mask image through multiple downsampling;
The fusion layer is used for fusing the feature images of the natural image sample with the feature images of the mask image;
The text encoder is used for extracting high-level semantic information of the input text information, and establishing association between image features and text features through contrast learning;
the feature fusion module comprises co-scale feature fusion and trans-scale feature fusion.
2. The deep learning-based image processing method according to claim 1, wherein training is performed based on a pre-built deep learning network using a natural image sample, prompt information corresponding to the natural image sample, and tag information to obtain the target detection model, and specifically comprising:
constructing a data set, wherein the data set comprises a natural image sample, prompt information corresponding to the natural image sample and label information;
dividing the data set into a training set, a verification set and a test set;
Inputting samples in the training set into a pre-constructed deep learning network for training to obtain an initial detection model;
And respectively verifying and testing the initial detection model by using a verification set and a test set to obtain the target detection model.
3. The image processing method based on deep learning according to claim 2, wherein the constructing the data set specifically comprises:
collecting a mass of natural image samples to establish a gallery;
Generating a mask image with equal resolution based on each natural image sample to establish a mask library;
labeling a target to be labeled in a non-mask area on a natural image sample to generate label information;
The natural image sample, the mask image corresponding to the natural image sample, and the tag information of the non-mask area thereof are taken as one unit to construct a data set having a plurality of units.
4. An image processing apparatus based on deep learning, based on the method according to any of claims 1-3, characterized in that the apparatus comprises:
The data acquisition unit is used for acquiring image data to be detected, and fusing the image data with the input prompt information to obtain input data;
The result generation unit is used for inputting the input data into a pre-constructed target detection model so as to obtain a target detection result output by the target detection model;
The target detection model is obtained by training a natural image sample, prompt information corresponding to the natural image sample and tag information based on a pre-constructed deep learning network.
5. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 3 when the program is executed by the processor.
6. A non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the method of any one of claims 1 to 3.
7. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1 to 3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411098789.9A CN118747797B (en) | 2024-08-12 | 2024-08-12 | Image processing method and device based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411098789.9A CN118747797B (en) | 2024-08-12 | 2024-08-12 | Image processing method and device based on deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118747797A CN118747797A (en) | 2024-10-08 |
CN118747797B true CN118747797B (en) | 2024-11-01 |
Family
ID=92921832
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202411098789.9A Active CN118747797B (en) | 2024-08-12 | 2024-08-12 | Image processing method and device based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118747797B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109063748A (en) * | 2018-07-16 | 2018-12-21 | 重庆大学 | Object detection method based on data enhancing |
CN116434002A (en) * | 2023-03-24 | 2023-07-14 | 国网河北省电力有限公司电力科学研究院 | Smoke detection method, system, medium and equipment based on lightweight neural network |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220410381A1 (en) * | 2021-06-29 | 2022-12-29 | Intrinsic Innovation Llc | Systems and methods for picking objects using 3-d geometry and segmentation |
CN117493674A (en) * | 2023-11-09 | 2024-02-02 | 齐鲁工业大学(山东省科学院) | Label enhancement-based supervision multi-mode hash retrieval method and system |
CN117710644A (en) * | 2023-11-21 | 2024-03-15 | 粤港澳大湾区数字经济研究院(福田) | Target detection method, device, equipment and storage medium based on visual prompt |
CN118247716B (en) * | 2024-05-29 | 2024-07-30 | 成都中扶蓉通科技有限公司 | Mouse target detection method, system and storage medium based on self-adaptive mask |
-
2024
- 2024-08-12 CN CN202411098789.9A patent/CN118747797B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109063748A (en) * | 2018-07-16 | 2018-12-21 | 重庆大学 | Object detection method based on data enhancing |
CN116434002A (en) * | 2023-03-24 | 2023-07-14 | 国网河北省电力有限公司电力科学研究院 | Smoke detection method, system, medium and equipment based on lightweight neural network |
Also Published As
Publication number | Publication date |
---|---|
CN118747797A (en) | 2024-10-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111160350B (en) | Portrait segmentation method, model training method, device, medium and electronic equipment | |
CN113011186B (en) | Named entity recognition method, named entity recognition device, named entity recognition equipment and computer readable storage medium | |
CN114298121B (en) | Multi-mode-based text generation method, model training method and device | |
CN113780486B (en) | Visual question answering method, device and medium | |
CN112188306B (en) | Label generation method, device, equipment and storage medium | |
CN113705733A (en) | Medical bill image processing method and device, electronic device and storage medium | |
CN114926835A (en) | Text generation method and device, and model training method and device | |
CN113989577B (en) | Image classification method and device | |
CN115761235A (en) | Zero sample semantic segmentation method, system, equipment and medium based on knowledge distillation | |
CN110287981A (en) | Conspicuousness detection method and system based on biological enlightening representative learning | |
CN117611845B (en) | Multi-mode data association identification method, device, equipment and storage medium | |
CN117235605B (en) | Sensitive information classification method and device based on multi-mode attention fusion | |
CN118747797B (en) | Image processing method and device based on deep learning | |
CN110852102B (en) | Chinese part-of-speech tagging method and device, storage medium and electronic equipment | |
CN117892140A (en) | Visual question and answer and model training method and device thereof, electronic equipment and storage medium | |
CN116580232A (en) | Automatic image labeling method and system and electronic equipment | |
CN113610080B (en) | Cross-modal perception-based sensitive image identification method, device, equipment and medium | |
CN112035670B (en) | Multi-modal rumor detection method based on image emotional tendency | |
CN115713621A (en) | Cross-modal image target detection method and device by using text information | |
CN115713082A (en) | Named entity identification method, device, equipment and storage medium | |
Joshi et al. | Optical Text Translator from Images using Machine Learning | |
CN112070060A (en) | Method for identifying age, and training method and device of age identification model | |
CN114022869B (en) | Vehicle heavy identification method and device based on cascade network | |
CN115270779B (en) | Method and system for generating ulcerative colitis structured report | |
CN110472728B (en) | Target information determining method, target information determining device, medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |