CN117994623A

CN117994623A - Image feature vector acquisition method

Info

Publication number: CN117994623A
Application number: CN202410299814.3A
Authority: CN
Inventors: 余文斌; 魏宏宇; 吕正杰; 陈宇浩; 任正举; 朱健杰
Original assignee: Wuxi Research Institute Of Nanjing University Of Information Engineering; Nanjing University of Information Science and Technology
Current assignee: Wuxi Research Institute Of Nanjing University Of Information Engineering; Nanjing University of Information Science and Technology
Priority date: 2024-03-15
Filing date: 2024-03-15
Publication date: 2024-05-07

Abstract

The invention discloses an image feature vector acquisition method in the technical field of computer vision, and aims to solve the technical problems of local information loss, insufficient and comprehensive image features and low extraction precision. It comprises the following steps: creatively proposes MAMSD a model to solve the limitations of local information loss and manual extraction characteristics in the traditional image retrieval, and improves the information extraction efficiency through a multi-head attention mechanism and a multi-scale characteristic fusion mechanism of a convolutional neural network; netVLAD layers are introduced into the MAMSD model, so that the local details of the image are more accurately described, and the retrieval accuracy and stability of the image feature vectors are improved; local feature descriptors are extracted through ResNet a convolution network model, mobileNet _v2 convolution network model and ConvNeXt _T convolution network model, so that the limitation of manually extracting features is removed, and the autonomous learning efficiency of the model on the features is improved.

Description

Image feature vector acquisition method

Technical Field

The invention relates to a method for acquiring an image feature vector, and belongs to the technical field of computer vision.

Background

Image retrieval techniques have become the focus of computer vision and vision synchronized localization and mapping VSLAM system research. The proposal of image retrieval greatly improves query efficiency, allows similar images to be retrieved from a massive database based on query images through a series of algorithms, and indexes the most similar images.

The inventor has conducted prior art findings that, early in image retrieval, the global descriptor of the extracted image is of primary concern. However, the method is difficult to achieve the expected effect when facing the conditions of illumination change, deformation, shielding, cutting and the like, local information is lost, image retrieval precision is low, the application range of a global descriptor algorithm is limited, and for the global descriptor algorithm, a part cannot be involved, and a local feature descriptor extraction method (such as scale-invariant feature transform (SIFT)/acceleration robust feature (SURF)) is generated. Although the local feature extraction algorithm has rotation, scale and brightness invariance and certain stability to video point change, affine transformation, noise and the like, the algorithm has poor real-time performance, and sometimes fewer extracted feature points are difficult to accurately extract the feature points of the smooth-edge target. Furthermore, these descriptors are designed manually and cannot learn the input descriptors.

Therefore, the inventor discovers that the existing global feature extraction method can cause local information loss, the image features are not rich and comprehensive, the existing local feature extraction method needs to manually extract the features, and the extraction precision is low.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provide an image feature vector acquisition method which is capable of comprehensively extracting image features, learning input descriptors and achieving excellent extraction precision.

In order to achieve the above purpose, the present disclosure is implemented by adopting the following technical schemes:

in a first aspect, the present disclosure provides a method for obtaining an image feature vector, including,

Inputting the image dataset into ResNet50 0, mobileNet _v2 and ConvNeXt _t convolution network models, respectively, and obtaining local feature descriptors from the corresponding convolution network models, respectively;

Inputting each local feature descriptor into a multi-head attention network of MAMSD model respectively, obtaining feature vectors of each attention head, and fusing to obtain feature vectors fused with all attention heads;

Inputting each local feature descriptor into a multi-scale fusion network of MAMSD models respectively to obtain multi-scale small modules, inputting the multi-scale small modules into NetVLAD layers to obtain feature vectors of each scale, and fusing the feature vectors of each scale to obtain feature vectors fused with all scales;

And fusing the feature vectors of all the attention heads and the feature vectors fused with all the scales again to obtain a feature map, and obtaining the image feature vector according to the feature map.

In some embodiments of the first aspect, the local feature descriptors are respectively input to a multi-head attention network of the MAMSD model, feature vectors of each attention head are obtained, and fusion is performed to obtain feature vectors fused with all attention heads, including,

Transforming the input local feature descriptors into d _model -dimensional feature vectors by linear transformation;

Dividing the feature vector of d _model dimensions into a plurality of sub-vectors equal to the number of attention heads;

Multiplying each sub-vector with the query matrix, the key matrix and the key matrix to obtain a query, a key and a value;

calculating a score for each attention head according to a softmax function with the query, the key and the value as calculation parameters;

And fusing all the attention heads according to the scores to obtain feature vectors fused with all the attention heads.

In some embodiments of the first aspect, the computing a score for each attention head according to a softmax function using the query, the key, and the value as computing parameters, including,

The calculation formula of the score of each attention head is as follows:

Where i is the number of Attention headers, h is the number of Attention headers, head _i is the score of the ith Attention header, q _i is the query vector of the ith Attention header, k _i is the key vector of the ith Attention header, v _i is the value vector of the ith Attention header, d _model is the dimension of the feature vector, attention is the Attention function, and T is the transpose of vector k _i.

The fusion of all attention heads according to the scores, the feature vectors fused with all attention heads are obtained, including,

The output formula of the feature vector fused with all the attention heads is as follows:

MutilHead＝Concat(head₁,head₂,...,head_h)；

Where Concat is a character string connection function and MutilHead is a multi-headed attention mechanism function.

In some embodiments of the first aspect, the local feature descriptors are respectively input to a multiscale fusion network of MAMSD models to obtain multiscale small modules, the multiscale small modules are input to NetVLAD layers to obtain feature vectors of each scale, and the feature vectors of each scale are fused to obtain feature vectors fused with all scales, including,

By usingThe small modules are represented, the small module size is d _x×d_y, j is the serial number of the small modules, n _m represents the total number of the small modules, M _j、x_j、y_j is the center coordinates of the j-th small module, and the calculation formula of the total number of the small modules n _m is as follows:

Where s _m is a set stride, x is a multiplication symbol, and H and W are dimensions of the local feature description, respectively.

In some embodiments of the first aspect, the NetVLAD layers are obtained by VLAD algorithm construction, and the method of obtaining NetVLAD layers comprises:

Obtaining a VLAD algorithm output vector:

The allocation parameters in the VLAD algorithm were micro-computed using the softmax layer, resulting in the NetVLAD algorithm as NetVLAD layer.

In some embodiments of the first aspect, the fusing of feature vectors for each scale includes fusing feature vectors from different scales using a cascading operation based on a selected fusion method, the fusion formula being:

F＝Concat(f1,f2,...,fn)；

Wherein F is a feature vector fused with all scales, concat is a character string connection function, F1, F2, & gt, fn is a feature vector of multiple scales from 1 to n, respectively, and n is the number of feature vectors of different scales.

In some embodiments of the first aspect, the obtaining an image feature vector according to the feature map includes regularizing the obtained feature map to obtain an image feature vector, and the regularized selection includes Intra regularization or L2 regularization.

In a second aspect, the present disclosure further provides an electronic terminal, including a processor and a memory connected to the processor, in which a computer program is stored, which when executed by the processor, performs the steps of the method for obtaining an image feature vector according to any one of the embodiments of the first aspect.

In a third aspect, the present disclosure also provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method for acquiring an image feature vector according to any one of the embodiments of the first aspect.

In a fourth aspect, the present disclosure also provides a computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of obtaining an image feature vector according to any of the embodiments of the first aspect.

Compared with the prior art, the beneficial effect that this disclosure reached:

The image feature vector acquisition method creatively proposes MAMSD models to solve the problems of local information loss and limitation of manual feature extraction in traditional image retrieval, and improves information extraction efficiency through a multi-head attention mechanism and a multi-scale feature fusion mechanism of a convolutional neural network; netVLAD layers are introduced into the MAMSD model, so that local details of the image are more accurately described, more comprehensive and rich image features are conveniently extracted, and the retrieval accuracy and stability of image feature vectors are further improved; local feature descriptors are extracted through ResNet a convolution network model, mobileNet _v2 convolution network model and ConvNeXt _T convolution network model, so that the limitation of manually extracting features is removed, and the autonomous learning efficiency of the model on the features is improved.

Drawings

In order to more clearly illustrate the technical solutions of the present disclosure or the prior art, the following description will briefly introduce the drawings that are required to be used in the embodiments or the prior art descriptions, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

Fig. 1 is a flowchart of a method for acquiring an image feature vector according to an embodiment of the present disclosure;

Fig. 2 is a schematic diagram of ResNet convolutional network model in the method for acquiring an image feature vector according to an embodiment of the present disclosure;

Fig. 3 is a schematic diagram of a MobileNet _v2 convolutional network model in an image feature vector acquisition method according to an embodiment of the present disclosure;

Fig. 4 is a schematic diagram of a ConvNeXt _t convolutional network model in an image feature vector acquisition method according to an embodiment of the present disclosure;

Fig. 5 is a schematic diagram of a MAMSD model in an image feature vector acquisition method according to an embodiment of the present disclosure.

Detailed Description

The following detailed description of the technical solutions of the present application will be given by way of the accompanying drawings and specific embodiments, and it should be understood that the specific features of the embodiments and embodiments of the present application are detailed descriptions of the technical solutions of the present application, and not limiting the technical solutions of the present application, and that the embodiments and technical features of the embodiments of the present application may be combined with each other without conflict.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Example 1

Fig. 1 is a flowchart of a method for acquiring an image feature vector according to a first embodiment of the present invention. The flow chart merely shows the logical sequence of the method according to the present embodiment, and the steps shown or described may be performed in a different order than shown in fig. 1 in other possible embodiments of the invention without mutual conflict.

The method for acquiring the image feature vector provided in the embodiment may be applied to a terminal, and may be executed by an electronic device, where the device may be implemented by software and/or hardware, and the device may be integrated in the terminal, for example: any smart phone, tablet computer or computer device with communication function. Referring to fig. 1, the method of the present embodiment specifically includes the following steps:

inputting the image dataset into ResNet50 0, mobileNet _v2 and ConvNeXt _t convolution network models, respectively, and extracting local feature descriptors from the three convolution network models, respectively; the local feature descriptors are input to a multi-headed attention network and a multi-scale fusion network of the MAMSD model.

Referring to fig. 2, the res net solves the gradient vanishing and representation bottleneck problems in deep neural networks by introducing residual blocks. ResNet50 the 50 layers comprise 50 layers, resNet the average pooling layer and fully attached layers are removed, and generally have good performance and generalization capability. Referring to fig. 3, mobilenetv2 retains the first convolution model and all reverse residual layers of the original MobileNet model while pruning the last two convolution layers and the largest pooling layer, mobileNet _v2 uses depth separable convolution and linear bottlenecks to reduce computation and model size while maintaining performance. Referring to fig. 3, convnext is a modern network architecture based on conventional convolution, convNeXt _t is a small version of this series of models, suitable for use in resource limited situations, convNeXt _t retains the first convolution layer and all ConNeXt modules of the original ConvNeXt model, while pruning the last max-pooling layer and layer-based normalization. The image content can be effectively represented through ResNet50 0 convolution network model, mobileNet _v2 convolution network model and ConvNeXt _t convolution network model, and the local descriptors can be extracted with low requirements on computing resources by extracting local features containing low-level information and high-level semantic features to perform end-to-end autonomous learning, and the autonomous extraction of the local features can be satisfied.

And acquiring the characteristic vector of each attention head through a multi-head attention network, and fusing the characteristic vectors of all the attention heads. The multi-headed attention mechanism allows the model to focus on information from different representation subspaces simultaneously, focusing each attention head on a different subspace of the input to capture multiple relationships. Each attention head can learn different feature representations independently, which increases the diversity and richness of information. Thus, obtaining feature vectors from each of the attention heads may provide more comprehensive information, helping the model to better understand and analyze the image. Feature vectors fusing all attention heads can integrate feature information learned by different heads together to form a more comprehensive feature representation. This integration helps the model capture multiple features in the image and combine them together to form a more powerful feature representation. Such feature representations are generally more discriminative and robust, helping to improve the performance of the model in various visual tasks.

The multi-scale small module is obtained through a multi-scale fusion network, the feature vector of each scale is obtained according to NetVLAD layers, the feature vectors of all scales are fused, and the multiple features of different scales are fused to improve the representation and generalization capability of the model. Small scale feature vectors may be more focused on detail and texture, while large scale feature vectors may be more focused on overall structure and layout. By fusing these feature vectors of different scales, a more comprehensive, representative feature representation can be generated that combines the advantages of different scales, improving the expressive power of the features. Fusing feature vectors of multiple scales can also reduce the risk of model overfitting. Overfitting refers to the situation where the model performs well on training data, but does not perform well on test data. By fusing the features of multiple scales, the model can learn more information from more extensive data, thereby reducing the likelihood of overfitting.

And fusing the feature vectors fused with all the attention heads and the feature vectors fused with all the scales again to obtain a feature map, and obtaining the image feature vector according to the feature map. The feature vectors generated by the attention head and the feature vectors of different scales each capture a different aspect of the image. The attention head focuses on different parts of the image, while the multi-scale features reflect information of the image at different scales. Fusing these features together can produce a more comprehensive and richer image representation that contains complementary information from different sources by fusing feature vectors from different sources, and the model can make decisions with more information to improve its performance. This fusion strategy allows the model to take more contextual information and details into account when processing complex tasks, thereby making more accurate predictions.

In summary, it is easy to see that, in the method for obtaining the image feature vector provided by the embodiment, MAMSD models are creatively provided to solve the problems of local information loss and limitation of manual feature extraction in traditional image retrieval, and the information extraction efficiency is improved through a multi-head attention mechanism and a multi-scale feature fusion mechanism of a convolutional neural network; netVLAD layers are introduced into the MAMSD model, so that local details of the image are more accurately described, more comprehensive and rich image features are conveniently extracted, and the retrieval accuracy and stability of image feature vectors are further improved; local feature descriptors are extracted through ResNet50 0 convolution network model, mobileNet _v2 convolution network model and ConvNeXt _T convolution network model, so that the limitation of manually extracting features is removed, and the autonomous learning efficiency of the model on the features is improved

Example two

The present embodiment provides an image feature vector obtaining method, which is improved based on the first embodiment, and reference may be made to the steps shown in fig. 1, which is not described in detail in the first embodiment.

The method for obtaining the image feature vector in the embodiment is further refined:

First, referring to extracting local feature descriptors from three convolution network models, and optionally referring to fig. 2, 3 and 4, it is assumed that dimensions (C, H, W) of a local feature descriptor extracted by a ResNet convolution network model, a MobileNet _v2 convolution network model and a ConvNeXt _t convolution network model of a certain image dataset are (2048,7,7), (320,7,7) and (7,7,768), respectively, and C, H, W are three dimensions of a local feature descriptor, respectively.

Referring to fig. 5, three local feature descriptors are input to the MAMSD model, and in one embodiment, reference is made to acquiring feature vectors of each attention header through a multi-head attention network, merging feature vectors of all attention headers, optionally in a further preferred but non-limiting embodiment of the present application, at least the following steps are included, for ease of processing, transforming the input local feature descriptors to d _model -dimensional feature vectors through linear transformation; in particular, a local feature descriptor is transformed as an input vector X ε R ^N×D into a d _model -dimensional feature vector within a multi-head attention networkThe calculation formula is as follows:

Wherein W ^Q is a parameter matrix, R is a real number domain, N is the number of descriptors, and D is the descriptor dimension.

Dividing the feature vector of d _model dimensions into a plurality of sub-vectors equal to the number of attention heads; specifically, the feature vector Z of d _model dimensions is divided into h equal parts to obtain h word vectorsWhere i is the sequence number of the attention header.

Each subvector is multiplied by a query matrix, a key matrix and a key matrix to obtain a query, a key and a value, specifically, for each attention head, three d _model -dimensional vectors are obtained by multiplying the subvector Zi by three parameter matrices, and the calculation formulas of the query vector, the key vector and the value vector are as follows:

Wherein, For querying the matrix,/>Is a key matrix,/>Is a value matrix, q _i is a query vector, k _i is a key matrix, v _i is a value vector,/>

Calculating a score for each attention head according to a softmax function with the query, the key and the value as calculation parameters; specifically, for each attention head, the final attention score is obtained by taking the dot product of the query vector and the key vector, and then applying the softmax function to calculate the attention score, and the calculation formula of the score of each attention head is as follows:

Where i is the number of Attention headers, h is the number of Attention headers, head _i is the score of the ith Attention header, q _i is the query vector of the ith Attention header, k _i is the key vector of the ith Attention header, v _i is the value vector of the ith Attention header, d _model is one dimension of the feature vector, attention is an Attention function, and T is the transpose of vector k _i.

Fusing all attention heads according to the scores to obtain feature vectors fused with all attention heads; specifically, the output formula of the fusion characteristic directions of all the attention heads is as follows:

MutilHead＝Concat(head₁,head₂,...,head_h)；

In a further preferred but non-limiting embodiment of the present application, the step at least further comprises, in the multi-scale fusion network, mapping the local feature descriptors as input features to F e R ^D×H×W, H, W to two dimensions of the local feature descriptors, D to the descriptor dimension, R to the real number field, setting the stride to s _m, obtaining a small module of size D _x×d_y, and then usingThe calculation formula of the total number n _m of the small modules is:

In a first embodiment, it is also mentioned that the use of NetVLAD layers to obtain the feature vectors for each scale, since the conventional method of outputting vector allocation parameters by the VLAD algorithm is hard allocation, in order to embed the VLAD algorithm into the convolutional neural network model NetVLAD layer for end-to-end training, the NetVLAD algorithm uses the softmax layer to perform a micro-operation on the allocation parameters to achieve soft allocation of the allocation parameters.

After the acquisition of the feature vectors fused with all the attention heads and the feature vectors fused with all the scales is completed, the feature vectors fused with all the attention heads and the feature vectors fused with all the scales are fused again, wherein the fusion formula is as follows:

F＝Concat(f1,f2,...,fn)；

In order to improve robustness, generalize capability and stability and avoid overfitting, the obtained feature images are regularized to obtain image feature vectors, and regularization selection comprises Intra regularization or L2 regularization.

According to the image feature vector acquisition method provided by the embodiment, through creatively proposing MAMSD model combination and through a recall ratio comparison experiment, compared with a traditional VLAD method, recall-25 values of the image feature vector acquisition method provided by the embodiment are increased by 17.6%, compared with a traditional NetVLAD method, the image feature vector acquisition method is increased by 2.62%, and compared with a patch-NetVLAD method, the image feature vector acquisition method is increased by 1.56%. Therefore, the image feature vector obtaining method provided by the embodiment improves the accuracy and stability of image retrieval, and because NetVLAD layers are involved in the embodiment, the local details of the image can be described more accurately, and the road condition can be resolved more accurately by combining the image feature vector obtaining method with the AR technology; creatively proposes MAMSD a model to further improve the accuracy of extraction and matching of image features.

Example III

The present embodiment provides an electronic terminal, including a processor and a memory connected to the processor, where a computer program is stored in the memory, and when the computer program is executed by the processor, referring to fig. 1, the steps of the method for acquiring an image feature vector according to one or two embodiments may be performed.

The present embodiment also provides a computer-readable storage medium having stored thereon a computer program, referring to fig. 1, which when executed by a processor implements the steps of the image feature vector acquisition method of any one or two of the embodiments.

The present embodiment also provides a computer program product comprising a computer program/instruction, referring to fig. 1, which when executed by a processor implements the steps of the image feature vector acquisition method according to any one of the first or second embodiments

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims

1. A method for obtaining image feature vector is characterized in that,

2. The method for obtaining image feature vectors according to claim 1, wherein each local feature descriptor is respectively input to a multi-head attention network of MAMSD model, feature vectors of each attention head are obtained, and fusion is performed to obtain feature vectors fused with all attention heads, including,

3. The method for acquiring the image feature vector according to claim 2, wherein the calculating the score of each attention head according to the softmax function using the query, the key and the value as the calculation parameters includes,

The calculation formula of the score of each attention head is as follows:

Where i is the number of Attention heads, h is the number of Attention heads, head _i is the score of the ith Attention head, q _i is the query vector of the ith Attention head, k _i is the key vector of the ith Attention head, v _i is the value vector of the ith Attention head, d _model is the dimension of the feature vector, attention is the Attention function, and T is the transpose of vector k _i;

MutilHead＝Concat(head₁,head₂,…,head_h)；

4. The method for obtaining the image feature vector according to claim 1, wherein each local feature descriptor is respectively input into a multi-scale fusion network of MAMSD models to obtain multi-scale small modules, the multi-scale small modules are input into NetVLAD layers to obtain feature vectors of each scale, the feature vectors of each scale are fused to obtain feature vectors fused with all scales, and the method comprises the steps of,

5. The method of claim 4, wherein the NetVLAD layers are constructed by a VLAD algorithm, and the method of obtaining NetVLAD layers comprises:

Obtaining a VLAD algorithm output vector:

6. The method according to claim 4, wherein the fusing feature vectors of each scale includes fusing feature vectors from different scales using a cascade operation based on the selected fusing method, the fusing formula being:

F＝Concat(f1,f2,...,fn)；

7. The method for obtaining an image feature vector according to claim 1, wherein obtaining the image feature vector according to the feature map includes regularizing the obtained feature map to obtain the image feature vector, and the regularized selection includes Intra regularization or L2 regularization.

8. An electronic terminal comprising a processor and a memory coupled to the processor, wherein a computer program is stored in the memory, which, when executed by the processor, performs the steps of the method of obtaining an image feature vector according to any one of claims 1 to 7.

9. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the image feature vector acquisition method according to any one of claims 1 to 7.

10. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of obtaining an image feature vector as claimed in any one of claims 1 to 7.