CN114332456A

CN114332456A - Target detection and identification method and device for large-resolution image

Info

Publication number: CN114332456A
Application number: CN202210255384.6A
Authority: CN
Inventors: 张凯; 马乐乐; 崔超然; 逯天斌
Original assignee: Shandong Liju Robot Technology Co ltd
Current assignee: Shandong Liju Robot Technology Co ltd
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2022-04-12

Abstract

The invention relates to the technical field of image recognition, in particular to a method and a device for detecting and recognizing a target of a high-resolution image, wherein the method comprises the following steps: acquiring a large-resolution image set, and performing data enhancement to obtain an enhanced image set; dividing each original image in the enhanced image set to obtain corresponding sub-images and position information thereof; coding and fusing the subimages and the position information thereof to obtain corresponding data tensors; performing feature representation learning on the data tensor layer by layer based on a Faster R-CNN model, fusing low-layer information, middle-layer information and high-layer information of the Faster R-CNN model by adopting an attention mechanism, determining feature representations corresponding to subimages, further determining candidate target positions, and performing regression and classification to determine a final target position and a category to which the final target position belongs; and determining the final target position and the category of the original image according to the final target position and the category of the original image. Through the scheme, the final model performance is improved.

Description

Target detection and identification method and device for large-resolution image

Technical Field

The invention relates to the technical field of image recognition, in particular to a method and a device for detecting and recognizing a target of a high-resolution image.

Background

With the rapid development of information technology, the convenience, high efficiency, safety and reliability brought by information processing make industrial informatization become the development trend of various industries. The image is a medium which is most ubiquitous in daily life and plays a key role in the information transfer process. Therefore, it is an important research content in the computer vision direction to efficiently and reliably utilize image information, and the image information attracts a great number of researchers.

In the early stage, because the semantic information of the image is complex, the traditional machine learning algorithm cannot fully understand the image information, and therefore the research is relatively simple. In recent years, the advent of deep learning, the improvement of computing performance, and the arrival of the era of big data brought about the fact that it is necessary to fully understand the use of image information, and many research subjects of fire and heat, such as image classification, image segmentation, object detection, face recognition, and re-recognition, have been generated around this, and have been largely successful in these directions.

It is noted, however, that although image studies have been successful in all directions with the continuous updating of learning algorithms, many study directions still face significant challenges in some special contexts. Including the problem of large resolution image object detection. The image is generally taken by professional equipment and has its fixed purpose, as distinguished from ordinary life pictures, such as satellite images or other aerial images. Such images often have their fixed role, e.g. for observing terrain, vegetation, water conservancy, or for military reconnaissance, re-orThe device is used for meteorological monitoring and the like. Taking a terrestrial satellite image as an example, the image needs to be detected when being used for observing the terrain, vegetation or water conservancy, and if the target is small, the common image target detection method does not work because the resolution of the image is too large, and firstly, the input of the common method is generally positioned at 10²*10²- 10³*10³The size of the large resolution image is usually much larger than the size of the large resolution image, and if the size of the original data is simply scaled, a large amount of information is lost, and especially when the detected target is small, the target may even be lost; secondly, the size of the image with large resolution is larger, and the amount of rich information is larger, so that the proportion of a target area to a background area is smaller; thirdly, due to the special application background, the quantity of the images is small, a large amount of experimental data cannot be obtained, and the training of the model is not facilitated. Due to the factors, the original method cannot obtain good precision in the aspect of large-resolution image target detection, so that the normal performance requirement cannot be met.

Disclosure of Invention

In order to overcome the problems in the related art, the invention provides a method and a device for detecting and identifying a target of a high-resolution image.

According to a first aspect of the embodiments of the present invention, there is provided a method for detecting and identifying a target in a high-resolution image, the method including:

acquiring a large-resolution image set, and performing data enhancement on the large-resolution image set to obtain an enhanced image set;

dividing each original image in the enhanced image set to obtain corresponding sub-images and position information thereof;

coding and fusing the subimages and the position information thereof to obtain corresponding data tensors;

performing feature representation learning layer by layer on the data tensor based on a Faster R-CNN model, and fusing low-layer information, middle-layer information and high-layer information of the Faster R-CNN model by adopting an attention mechanism to determine feature representation corresponding to the subimages;

determining candidate target positions according to the feature representations corresponding to the sub-images, and performing regression and classification to determine the final target position of each sub-image and the category of the sub-image;

and determining the final target position and the category of the original image according to the final target position and the category of each sub-image.

In one embodiment, preferably, segmenting each image in the enhanced image set to obtain a corresponding sub-image and position information thereof includes:

dividing each original image by adopting a fixed window overlapping type dividing mode to obtain corresponding sub-images, and arranging the sub-images according to the sequence;

and performing data preprocessing on each sub-image, and determining the position information of the sub-image in the original image, wherein the position information comprises the coordinates of the center point of the sub-image in the original image and the width and height of the sub-image.

In one embodiment, preferably, the fusing the low-level information, the middle-level information and the high-level information of the Faster R-CNN model by using an attention mechanism to determine the corresponding feature representation of the sub-image includes:

respectively taking the low-layer information, the middle-layer information and the high-layer information of the Faster R-CNN model as Q, K and V in an attention mechanism, and calculating by adopting the following formula to determine the characteristic representation corresponding to the sub-image;

wherein Z represents the feature representation corresponding to the sub-image, Q represents a query item, K represents a key value, V represents a parameter value, and d represents a super parameter.

In one embodiment, preferably, determining the final target position of the original image and the category to which the final target position belongs according to the final target position of each sub-image and the category to which the final target position belongs includes:

and merging the final target positions of which the distances between the sub-images are smaller than a preset threshold value according to the position information of the sub-images in the original image to determine the final target position of the original image, and determining the category with the maximum category probability value as the category to which the original image belongs.

In one embodiment, the candidate target locations are preferably determined using the following first calculation:

wherein,

indicating a loss of the candidate target location,

which represents the parameters to be learned and,

is shown asiA vector representation of the position of the individual candidate objects,

is shown asiThe offset at which the candidate target location changes to the true target location,

which represents the amount of the offset of the prediction,

representing the regularization term, γ is a hyper-parameter;

regression and classification are performed using the following second calculation formula:

l ₂= L _cls + λL _loc (2)

wherein,l ₂the sum of the losses of the regression and classification is represented,L _clsrepresenting classificationsThe loss of the carbon dioxide gas is reduced,L _locthe loss of position is indicated and,λis a super parameter, and balances the two loss parts.

According to a second aspect of embodiments of the present invention, there is provided an object detection and recognition apparatus for a large-resolution image, the apparatus including:

the enhancement module is used for acquiring a large-resolution image set and enhancing data of the large-resolution image set to obtain an enhanced image set;

the segmentation module is used for segmenting each original image in the enhanced image set to obtain corresponding sub-images and position information thereof;

the processing module is used for coding and fusing the sub-images and the position information thereof to obtain corresponding data tensors;

the fusion module is used for performing feature representation learning layer by layer on the data tensor based on a Faster R-CNN model, and fusing low-layer information, middle-layer information and high-layer information of the Faster R-CNN model by adopting an attention mechanism so as to determine feature representation corresponding to the subimages;

the first determining module is used for determining candidate target positions according to the feature representations corresponding to the sub-images, and performing regression and classification to determine the final target position of each sub-image and the category of the sub-image;

and the second determining module is used for determining the final target position and the category of the original image according to the final target position and the category of each sub-image.

In one embodiment, preferably, the segmentation module includes:

the segmentation unit is used for segmenting each original image by adopting a fixed window overlapping segmentation mode to obtain corresponding sub-images and arranging the sub-images in sequence;

and the preprocessing unit is used for preprocessing data of each sub-image and determining the position information of the sub-image in the original image, wherein the position information comprises the coordinates of the center point of the sub-image in the original image and the width and height of the sub-image.

In one embodiment, preferably, the fusion module is configured to:

In one embodiment, preferably, the second determining module is configured to:

wherein,

indicating a loss of the candidate target location,

which represents the parameters to be learned and,

which represents the amount of the offset of the prediction,

representing the regularization term, γ is a hyper-parameter;

l ₂= L _cls + λL _loc (2)

wherein,l ₂the sum of the losses of the regression and classification is represented,L _clsa loss of classification is indicated and,L _locthe loss of position is indicated and,λis a super parameter, and balances the two loss parts.

According to a third aspect of embodiments of the present invention, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method of any one of the first aspect.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

the invention realizes target detection on a high-resolution image based on high-low layer information fusion, and compared with the previous method. Since the large-resolution image size is much larger than the normal image, the large-resolution image needs to be preprocessed in order to make the model feasible. If the mode of equal scaling is adopted, target information is likely to be lost, and in order to solve the problem, the invention adopts the window overlapping type to segment the image to obtain the sub-images of the original image, and the sub-images are sequentially used as the input of the model, so that the integrity of the information is effectively ensured; meanwhile, in order to avoid losing the position information of the sub-image in the original image, the invention additionally adds the position information on the characteristic representation to enhance the integrity of the spatial information of the sub-image; in addition, during feature learning, the method utilizes an attention mechanism to perform weighted fusion on high-layer information and low-layer information to obtain corresponding convolutional layer output for downstream tasks, and feature information is greatly enriched through the method, so that the final model performance is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow diagram illustrating a method for object detection and recognition of a high resolution image according to an exemplary embodiment.

Fig. 2 is a flowchart illustrating a step S102 in a target detecting and recognizing method of a large-resolution image according to an exemplary embodiment.

FIG. 3 is a schematic diagram illustrating a method of object detection and recognition for a high resolution image according to an exemplary embodiment.

FIG. 4 is a block diagram illustrating an apparatus for object detection and recognition of a large resolution image according to an exemplary embodiment.

FIG. 5 is a block diagram illustrating an apparatus for object detection and recognition of a large resolution image according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

FIG. 1 is a flow diagram illustrating a method for object detection and recognition of a high resolution image, as shown in FIG. 1, according to an exemplary embodiment, the method comprising:

step S101, acquiring a large-resolution image set, and performing data enhancement on the large-resolution image set to obtain an enhanced image set;

due to the special background of the large-resolution picture, the data volume is far smaller than that of a common picture task, so that the problem of model training insufficiency caused by the insufficiency of the data volume is easily caused.

Step S102, segmenting each original image in the enhanced image set to obtain corresponding sub-images and position information thereof;

because the size of the original image is too large, the whole image cannot be input into the model at one time, and the information is easily lost by the traditional downsampling method.

Step S103, encoding and fusing the sub-images and the position information thereof to obtain corresponding data tensors;

step S104, based on a Faster R-CNN model, performing feature representation learning layer by layer on the data tensor, and fusing low-layer information, middle-layer information and high-layer information of the Faster R-CNN model by adopting an attention mechanism to determine feature representation corresponding to the subimages;

step S105, determining candidate target positions according to the feature representations corresponding to the sub-images, and performing regression and classification to determine the final target position of each sub-image and the category of the sub-image;

and step S106, determining the final target position and the category of the original image according to the final target position and the category of each sub-image.

The method is based on high-low layer information fusion, utilizes an attention mechanism and multi-layer weighted fusion as final feature representation. Since the large-resolution image size is much larger than the normal image, the large-resolution image needs to be preprocessed in order to make the model feasible. If the mode of equal scaling is adopted, target information is likely to be lost, and in order to solve the problem, the invention adopts the window overlapping type to segment the image to obtain the sub-images of the original image, and the sub-images are sequentially used as the input of the model, so that the integrity of the information is effectively ensured; meanwhile, in order to avoid losing the position information of the sub-image in the original image, the invention additionally adds the position information on the characteristic representation to enhance the integrity of the spatial information of the sub-image; in addition, during feature learning, the method utilizes an attention mechanism to perform weighted fusion on high-layer information and low-layer information to obtain corresponding convolutional layer output for downstream tasks, and feature information is greatly enriched through the method, so that the final model performance is improved.

As shown in fig. 2, in one embodiment, preferably, the step S102 includes:

step S201, segmenting each original image by adopting a fixed window overlapping segmentation mode to obtain corresponding sub-images, and arranging the sub-images in sequence;

step S202, performing data preprocessing on each sub-image, and determining the position information of the sub-image in the original image, wherein the position information comprises the coordinates of the center point of the sub-image in the original image and the width and height of the sub-image.

In order to make the data information more accurate, the position information of the obtained sub-image in the original image is added, the position information comprises four coordinates of upper left, lower left, upper right and lower right, a four-dimensional vector is used for representing (x, y, w, h), wherein x and y are the coordinates of the center point of the sub-image in the original image, and w and h are the width and the height of the sub-image respectively.

wherein,

indicating a loss of the candidate target location,

which represents the parameters to be learned and,

which represents the amount of the offset of the prediction,

representing the regularization term, γ is a hyper-parameter;

l ₂= L _cls + λL _loc (2)

The above technical solution of the present invention is described in detail with a specific embodiment, as shown in fig. 3, the Attention mechanism module effectively performs weighted fusion on the low-level and high-level information, mainly extracts three layers of representations, namely, low, medium and high-level representations, and uses the low, medium and high-level representations as q, k and v in the Attention mechanism, in this way, the general weighted fusion process is simplified, that is, it is not necessary to separately calculate corresponding weights for the low, medium and high-level representations, and then fuse the weighted information into one representation. By adopting the mode, the method greatly simplifies the original operation, fully utilizes high-low layer information and enriches characteristic representation, thereby providing a better representation for downstream tasks.

As shown in fig. 4, an object detecting and recognizing apparatus for a large resolution image, the apparatus comprising:

the enhancing module 41 is configured to obtain a large-resolution image set, and perform data enhancement on the large-resolution image set to obtain an enhanced image set;

a segmentation module 42, configured to segment each original image in the enhanced image set to obtain a corresponding sub-image and position information thereof;

a processing module 43, configured to perform encoding and fusion processing on the sub-image and the position information thereof to obtain a corresponding data tensor;

a fusion module 44, configured to perform feature representation learning layer by layer on the data tensor based on a Faster R-CNN model, and fuse low-layer information, middle-layer information, and high-layer information of the Faster R-CNN model by using an attention mechanism to determine feature representations corresponding to the subimages;

a first determining module 45, configured to determine a candidate target position according to the feature representation corresponding to each sub-image, and perform regression and classification to determine a final target position of each sub-image and a category to which the final target position belongs;

and a second determining module 46, configured to determine the final target position of the original image and the category to which the final target position belongs according to the final target position of each sub-image and the category to which the final target position belongs.

As shown in fig. 5, in one embodiment, the segmentation module 42 preferably includes:

a dividing unit 51, configured to divide each original image by using a fixed-window overlapping type dividing manner to obtain corresponding sub-images, and arrange the sub-images in sequence;

the preprocessing unit 52 is configured to perform data preprocessing on each sub-image, and determine position information of the sub-image in the original image, where the position information includes coordinates of a center point of the sub-image in the original image and a width and a height of the sub-image.

In one embodiment, preferably, the fusion module 44 is configured to:

In one embodiment, preferably, the second determining module 46 is configured to:

wherein,

indicating a loss of the candidate target location,

which represents the parameters to be learned and,

which represents the amount of the offset of the prediction,

representing the regularization term, γ is a hyper-parameter;

l ₂= L _cls + λL _loc (2)

According to a fourth aspect of the embodiments of the present invention, there is provided a target detection and recognition system based on a large-resolution image, the system including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

It is further understood that the term "plurality" means two or more, and other terms are analogous. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. The singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It will be further understood that the terms "first," "second," and the like are used to describe various information and that such information should not be limited by these terms. These terms are only used to distinguish one type of information from another and do not denote a particular order or importance. Indeed, the terms "first," "second," and the like are fully interchangeable. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention.

It is further to be understood that while operations are depicted in the drawings in a particular order, this is not to be understood as requiring that such operations be performed in the particular order shown or in serial order, or that all illustrated operations be performed, to achieve desirable results. In certain environments, multitasking and parallel processing may be advantageous.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method for object detection and recognition of a high resolution image, the method comprising:

2. The method of claim 1, wherein segmenting each image in the enhanced image set to obtain a corresponding sub-image and its position information comprises:

3. The method as claimed in claim 1, wherein fusing the low-level information, the middle-level information and the high-level information of the Faster R-CNN model using an attention mechanism to determine the corresponding feature representation of the sub-image comprises:

4. The method of claim 1, wherein determining the final target position of the original image and the category to which the final target position belongs according to the final target position of each sub-image and the category to which the final target position belongs comprises:

5. The method of claim 1, wherein the candidate target locations are determined using a first calculation: