CN114782714A

CN114782714A - Image matching method and device based on context information fusion

Info

Publication number: CN114782714A
Application number: CN202210161767.7A
Authority: CN
Inventors: 周振; 俞益洲; 李一鸣; 乔昕
Original assignee: Beijing Shenrui Bolian Technology Co Ltd; Shenzhen Deepwise Bolian Technology Co Ltd
Current assignee: Beijing Shenrui Bolian Technology Co Ltd; Shenzhen Deepwise Bolian Technology Co Ltd
Priority date: 2022-02-22
Filing date: 2022-02-22
Publication date: 2022-07-22

Abstract

The invention provides an image matching method and device based on context information fusion. The method comprises the following steps: respectively segmenting image blocks to be matched from the image A, B by using an image segmentation network, and splicing each image block and the processed whole image into an image block group; inputting the image block groups of the image A, B into two sub-networks of the twin network respectively for feature extraction, and calculating the similarity of one image block group of the image A and one image block group of the image B based on the extracted texture features and a vector consisting of context information; and obtaining the image blocks matched in the image A, B by adopting a Hungarian algorithm based on the similarity. The invention simultaneously utilizes the texture characteristics and the context information of the image blocks to carry out image block matching, thereby improving the precision of image matching.

Description

Image matching method and device based on context information fusion

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to an image matching method and device based on context information fusion.

Background

The image matching is based on the similarity of image blocks among images, and the matching between the image blocks to be inquired and the candidate image blocks is realized. Image matching is the basis of many tasks in the field of computer vision, such as object recognition, image retrieval, wide-baseline matching, and the like. However, in reality, there are many factors that influence the determination of the similarity of image blocks in different images, such as different image capturing angles, different illumination, occlusion, shadowing, different camera performances, and so on. The traditional image block matching method is based on manual design of image block feature descriptors, but the manual design-based feature descriptors cannot fully consider the influence of the above factors. In recent years, with the advent of large-scale data volume, image block feature extraction has gradually shifted from manual design to automatic learning of image block features.

Most of the methods for automatically extracting image block features and judging similarity at present are based on a twin neural network. Sub-networks in a twin neural network are generally structurally identical, weight shared. The general flow of the method is as follows: firstly, defining similar positive sample pairs and dissimilar negative sample pairs, designing a twin convolutional neural network, taking different image blocks as the input of different subnetworks, extracting the features of the image blocks through a weight-shared subnetwork, and then calculating the similarity of the extracted features based on a certain similarity measurement function or predicting whether the extracted features are matched based on a certain classifier. The image block matching method only extracts the texture features of the image blocks, and relative to the whole image, the image blocks lose context information and relative position relation with surrounding objects, and when different examples of the same category are matched, the provided information is insufficient, and objects with similar textures but larger spatial difference are easily judged to be matched by mistake.

Disclosure of Invention

In order to solve the above problems in the prior art, the present invention provides an image matching method and apparatus based on context information fusion.

In order to achieve the above object, the present invention adopts the following technical solutions.

In a first aspect, the present invention provides an image matching method based on context information fusion, including the following steps:

respectively segmenting image blocks to be matched from the image A, B by using an image segmentation network, and splicing each image block and the processed whole image into an image block group;

respectively inputting image block groups of an image A, B into two subnetworks of a twin network with the same structure and shared weight, extracting texture features based on image blocks, extracting context information based on the whole image, and calculating the similarity between one image block group of the image A and one image block group of the image B based on a vector formed by the texture features and the context information;

and processing by adopting a Hungarian algorithm based on the similarity to obtain the matched image blocks in the image A, B.

Further, the image A, B is two images under multi-view vision, i.e., two images taken from different angles for the same scene.

Further, the method for obtaining the image block group comprises the following steps:

obtaining a segmentation mask of a target example, and cutting along the minimum circumscribed rectangle to obtain an image block containing the target example;

setting pixels corresponding to the segmentation masks in the whole image as background pixel values, and scaling the background pixel values to the size of the image block to obtain a processed whole image;

and splicing the image blocks and the processed whole image in a channel dimension to obtain the image block group.

Furthermore, the twin network adopts a grouping convolutional neural network to extract the characteristics of the image block groups, the first group of convolutional neural networks extracts texture characteristics based on the image blocks in the image block groups, and the second group of convolutional neural networks extracts context information based on the processed whole image in the image block groups.

Further, training the twin network with a training data set consisting of pairs of positive samples and pairs of negative samples; for two different view images M, N, M of the same scene_im,jmFor the ith in the image M_mJ (th) of the object_mImage block group corresponding to each image block, N_in,jnFor the ith in image N_nJ-th of like object_nThe image block group corresponding to each image block is set as i_m＝i_nAnd j is_m＝j_nIn time (M)_im,jm,N_in,jn) Is a positive sample pair; when i is_m≠i_nOr j_m≠j_nWhen (M)_im,jm,N_in,jn) Are negative sample pairs.

Furthermore, the twin network only calculates and outputs the similarity of the image block groups corresponding to the image blocks of the same object class, and the object classes of the image blocks are output by the image segmentation network.

In a second aspect, the present invention provides an image matching apparatus based on context information fusion, including:

the image block segmentation module is used for segmenting image blocks to be matched from the image A, B by using an image segmentation network and splicing each image block and the processed whole image into an image block group;

the similarity calculation module is used for respectively inputting the image block groups of the image A, B into two subnetworks of a twin network with the same structure and shared weight, extracting texture features based on the image blocks, extracting context information based on the whole image, and calculating the similarity between one image block group of the image A and one image block group of the image B based on a vector formed by the texture features and the context information;

and the image block matching module is used for processing by adopting a Hungarian algorithm based on the similarity to obtain a matched image block in the image A, B.

setting pixels corresponding to the segmentation masks in the whole image as background pixel values, and scaling the background pixel values into the size of the image block to obtain a processed whole image;

Further, training the twin network with a training data set consisting of pairs of positive and negative samples; for two different perspective images M, N, M of the same scene_im,jmFor the ith in the image M_mJ (th) of the object_mImage block group corresponding to each image block, N_in,jnIs the ith in the image N_nJ (th) of the object_nThe image block group corresponding to each image block is set as i_m＝i_nAnd j is_m＝j_nWhen (M)_im,jm,N_in,jn) Is a positive sample pair; when i is_m≠i_nOr j_m≠j_nIn time (M)_im,jm,N_in,jn) Are negative sample pairs.

Compared with the prior art, the invention has the following beneficial effects.

The image blocks to be matched are respectively segmented from the image A, B, each image block and the processed whole image are spliced into an image block group, texture features and context information are extracted from the image block groups by using a twin network, the similarity between one image block group of the image A and one image block group of the image B is calculated based on a vector formed by the texture features and the context information, and the image blocks matched in the two images are obtained by processing the image blocks by using the Hungarian algorithm, so that the automatic matching of the image blocks is realized. The invention simultaneously utilizes the texture characteristics and the context information of the image blocks to carry out image block matching, thereby improving the precision of image matching.

Drawings

Fig. 1 is a flowchart of an image matching method based on context information fusion according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of an image block and a processed whole image.

FIG. 3 is a schematic view of the processing flow of the method of the present invention.

Fig. 4 is a block diagram of an image matching apparatus based on context information fusion according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and more obvious, the present invention is further described below with reference to the accompanying drawings and the detailed description. It should be apparent that the described embodiments are only some embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Fig. 1 is a flowchart of an image matching method based on context information fusion according to an embodiment of the present invention, including the following steps:

step 101, respectively segmenting image blocks to be matched from an image A, B by using an image segmentation network, and splicing each image block and the whole processed image into an image block group;

step 102, respectively inputting image block groups of the image A, B into two sub-networks of a twin network with the same structure and shared weight, extracting texture features based on image blocks, extracting context information based on the whole image, and calculating the similarity between one image block group of the image A and one image block group of the image B based on a vector formed by the texture features and the context information;

and 103, processing by adopting a Hungarian algorithm based on the similarity to obtain the image blocks matched in the image A, B.

The embodiment provides an image matching method based on context information fusion. The matching method is used to find matching image blocks from the two input images A, B. Firstly, image blocks to be matched are respectively divided from an image A, B, for example, image blocks a1, a2, A3, a4 and B1, B2, B3 and B4 of four computers are divided from two images similar to that shown in fig. 2; then, according to a certain algorithm, a matching pair of image blocks is found out from the image blocks belonging to the two images, for example, the pairs of image blocks belonging to the same computer (a1, B3), (a2, B1), (A3, B4), (a4, B1).

In this embodiment, step 101 is mainly used to obtain the image block group to be matched. As described above, in the present embodiment, the matching image block is to be found from the two input images, and therefore, the image block to be matched needs to be first divided from the two input images. The image block generally refers to an image block containing one or more types of objects, such as a water cup, a computer, and the like in fig. 2. An image segmentation network may be used to segment image blocks from an entire input image. Image segmentation networks typically employ full convolution neural networks. Since the image matching of the present embodiment usually needs to be performed in real time, the present embodiment adopts a YOLACT real-time instance segmentation network. In the prior art, the segmented image blocks are generally directly input into a twin network for matching, and the matching accuracy is generally not ideal because only texture features can be extracted from a single image block and context information cannot be obtained. For this purpose, in this embodiment, each image block and the properly processed whole image are combined into an image block set, then the image block set corresponding to each image block is input to the twin network for feature extraction, and context information is extracted from the processed whole image. There are many processing methods for the whole image, and this embodiment does not limit the specific processing method, and the following embodiment will provide a specific processing method.

In this embodiment, step 102 is mainly used to calculate the similarity between the image block group in the image a and the image block group in the image B. A twin network is first constructed consisting of two sub-networks with structurally identical weight sharing. To achieve a fast match, a lightweight ResNet18 may be selected as a sub-network of the twin network; then, extracting texture features and context information from the input image block group by using the twin network, and splicing the texture features and the context information into a vector; and finally, calculating the similarity of each image block group of the image A and one image block group of the image B based on the vector.

In this embodiment, step 103 is mainly used for image block matching. The embodiment adopts Hungarian algorithm to realize image block matching. The hungarian algorithm is the algorithm in the graph theory to find the maximum number of matches. In this embodiment, a similarity matrix is constructed by using the similarity of the image block groups obtained in the previous step. For simplicity, it is assumed that image a is divided into 3 image blocks and image B is divided into 4 image blocks, and the similarity matrix is shown in table 1. First, two image block groups with the largest similarity are found, for example, the similarity of the image a _ patch2 and the image B _ patch4 in table 1 is 0.8, which is considered to be a one-to-one match, and the two image block groups are removed from the matrix, as shown in table 2. Then, the next image patch group with the largest similarity is found, such as the similarity of 0.5 between the image a _ patch3 and the image B _ patch1 in table 2, and is also removed from the matrix, as shown in table 3. Repeating the above process until no image block group which can be matched exists, and finally obtaining a matched image block result as follows: (A _ patch2, B _ patch4), (A _ patch3, B _ patch1), (A _ patch1, B _ patch 3).

TABLE 1

	Image A _ patch1	Image A _ patch2	Image A _ patch3
				Image B _ patch1	0.10	0.40	0.50
Image B _ patch2	0.20	0.60	0.30
				Image B _ patch3	0.30	0.70	0.25
Image B _ patch4	0.50	0.80※	0.30

TABLE 2

	Image A _ patch1	Image A _ patch2	Image A _ patch3
				Image B _ patch1	0.10	———	0.50※
Image B _ patch2	0.20	———	0.30
				Image B _ patch3	0.30	———	0.25
Image B _ patch4	———	———	———

TABLE 3

	Image A _ patch1	Image A _ patch2	Image A _ patch3
				Image B _ patch1	———	———	———
Image B _ patch2	0.20	———	———
				Image B _ patch3	0.30※	———	———
Image B _ patch4	———	———	———

As an alternative embodiment, the image A, B is two images under multi-view vision, i.e., two images taken from different angles for the same scene.

The present embodiment defines an image A, B to be matched. In this embodiment, the image A, B is two images in a multi-view. A multi-view visual image is an image of the same scene taken from different angles. The images of the same object in the two multi-view images are different as if the three views (front, top, side) of the same object were different, except that the differences are less apparent than the three views. The image matching of this embodiment is to find the image blocks corresponding to the same object from the two multi-view visual images.

As an alternative embodiment, the method for obtaining the image block group includes:

obtaining a segmentation mask of a target example, and cutting along the minimum external rectangle to obtain an image block containing the target example;

This embodiment provides a technical solution for obtaining image block groups. After the image is segmented by the target instances (e.g., the computer in fig. 2), a segmentation mask for each target instance is obtained. First, based on a segmentation mask, a minimum bounding rectangle of a target instance is calculated and an image block as the target instance is cut. Then, the pixels corresponding to the segmentation mask of the target instance in the whole image are set as background pixel values and scaled to the size of the image block of the target instance, as shown in fig. 2. Then, the image blocks and the processed whole image are spliced in the dimension of the channel, for example, assuming that the size of the image blocks is 1 × 32 × 32 × 32, the size of the processed whole image is 1 × 32 × 32 × 32, the size of the spliced image blocks is 2 × 32 × 32 × 32, and an image with twice the number of original channels is formed as an input of the twin network subnetwork.

As an optional embodiment, the twin network performs feature extraction on the image block groups by using a grouped convolutional neural network, the first group of convolutional neural networks extracts texture features based on the image blocks in the image block groups, and the second group of convolutional neural networks extracts context information based on the processed whole image in the image block groups.

The present embodiment further defines the structure of the twin network. As described above, the present embodiment extracts not only the texture features but also the context information. In order to extract texture features and context information from the image block groups, the sub-networks of the twin network of this embodiment use grouped convolutional neural networks, that is, a first group of convolutional neural networks and a second group of convolutional neural networks are provided, where the first group of convolutional neural networks extracts texture features based on the image blocks in the image block groups, and the second group of convolutional neural networks extracts context information based on the processed whole image in the image block groups.

As an alternative embodiment, a combination of positive and negative samples is usedTraining the twin network on the composed training data set; for two different perspective images M, N, M of the same scene_im,jmFor the ith in the image M_mJ (th) of the object_mImage block group corresponding to each image block, N_in,jnFor the ith in image N_nJ (th) of the object_nThe image block group corresponding to each image block is defined as i_m＝i_nAnd j is_m＝j_nWhen (M)_im,jm,N_in,jn) Is a positive sample pair; when i is_m≠i_nOr j_m≠j_nWhen (M)_im,jm,N_in,jn) Are negative sample pairs.

The present embodiment defines a training data set for training the twin network. In this embodiment, the training data set of the twin network is composed of a positive sample pair and a negative sample pair, one sample pair is composed of two image block groups belonging to the images M, N of two different viewing angles, respectively, the matched two image block groups constitute the sample pair as the positive sample pair, and the unmatched two image block groups constitute the sample pair as the negative sample pair. Generally, the target instances in the image M, N are classified according to object types (object types to which image segmentation network output image blocks belong), such as computers in fig. 2, cups, etc.; the number of target instances of the same object class may be generally 1 or more. For convenience of description, use M_im,jmRepresenting the ith in the image M_mJ-th of like object_mFor groups of blocks corresponding to individual image blocks, N_in,jnRepresenting the ith in image N_nJ (th) of the object_nFor example, the 2 nd computer in the image M and the 2 nd computer in the image N are the same computer (i.e. the two are matched). Based on the above assumptions, when i_m＝i_nAnd j is_m＝j_nIn time (M)_im,jm,N_in,jn) The image blocks are positive sample pairs, namely the image block groups with the same category and sequence number are the positive sample pairs; when i is_m≠i_nOr j_m≠j_nIn time (M)_im,jm,N_in,jn) The image blocks are negative sample pairs, that is, the image blocks with the same category and sequence number are not uniform. Ratio ofFor example, the tile groups corresponding to the 2 nd computer in the image M and the 2 nd computer in the image N form a positive sample pair, and the tile groups corresponding to the 2 nd computer in the image M and the 3 rd computer in the image N form a negative sample pair. The positive or negative sample pairs are usually constructed by manual labeling. The present embodiment is only schematically described for the process of forming the positive and negative sample pairs.

As an alternative embodiment, the twin network only calculates and outputs the similarity of the image block groups corresponding to the image blocks of the same object class, and the object classes of the image blocks are output by the image segmentation network.

The embodiment provides a technical scheme for improving the matching speed. Since only the image blocks of the same object class may be matched image blocks, in order to reduce the amount of calculation and increase the matching speed, the embodiment only processes the image block groups corresponding to the image blocks of the same object class, that is, the twin network only calculates and outputs the similarity of the image block groups corresponding to the image blocks of the same object class. After the processing, the calculation amount of matching by using the Hungarian algorithm is greatly reduced, so that the image matching speed is increased.

Fig. 4 is a schematic composition diagram of an image matching apparatus based on context information fusion according to an embodiment of the present invention, where the apparatus includes:

the image block segmentation module 11 is configured to segment image blocks to be matched from the image A, B by using an image segmentation network, and splice each image block and the processed whole image into an image block group;

the similarity calculation module 12 is configured to input image block groups of the image A, B to two subnetworks of a twin network having the same structure and shared weights, extract texture features based on image blocks, extract context information based on the entire image, and calculate similarity between an image block group of the image a and an image block group of the image B based on a vector formed by the texture features and the context information;

and the image block matching module 13 is configured to perform processing by using a hungarian algorithm based on the similarity to obtain image blocks matched in the image A, B.

The apparatus of this embodiment may be configured to implement the technical solution of the method embodiment shown in fig. 1, and the implementation principle and the technical effect are similar, which are not described herein again. The same applies to the following embodiments, which are not further described.

As an alternative embodiment, the twin network is trained using a training data set consisting of pairs of positive and negative samples; for two different view images M, N, M of the same scene_im,jmFor the ith in the image M_mJ (th) of the object_mImage block group corresponding to each image block, N_in,jnFor the ith in image N_nJ-th of like object_nThe image block group corresponding to each image block is set as i_m＝i_nAnd j is_m＝j_nWhen (M)_im,jm,N_in,jn) Is a positive sample pair; when i is_m≠i_nOr j_m≠j_nWhen (M)_im,jm,N_in,jn) Are negative sample pairs.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An image matching method based on context information fusion is characterized by comprising the following steps:

respectively inputting image block groups of an image A, B into two sub-networks of a twin network shared by the same structure and weight, extracting texture features based on image blocks, extracting context information based on the whole image, and calculating the similarity between one image block group of an image A and one image block group of an image B based on a vector formed by the texture features and the context information;

2. The method of image matching based on context information fusion of claim 1, wherein the image A, B is two images under multi-view vision, i.e. two images taken from different angles for the same scene.

3. The image matching method based on context information fusion according to claim 1, wherein the obtaining method of the image block group comprises:

4. The image matching method based on context information fusion of claim 1, wherein the twin network uses a grouping convolutional neural network to perform feature extraction on the image block groups, the first group of convolutional neural networks extracts texture features based on the image blocks in the image block groups, and the second group of convolutional neural networks extracts context information based on the processed whole image in the image block groups.

5. The image matching method based on context information fusion according to claim 1, characterized in that the twin network is trained with a training data set consisting of positive and negative sample pairs; for two different perspective images M, N, M of the same scene_im,jmFor the ith in the image M_mJ (th) of the object_mImage block group corresponding to each image block, N_in,jnIs the ith in the image N_nJ (th) of the object_nThe image block group corresponding to each image block is set as i_m＝i_nAnd j is_m＝j_nIn time (M)_im,jm,N_in,jn) Is a positive sample pair; when i is_m≠i_nOr j_m≠j_nWhen (M)_im,jm,N_in,jn) Are negative sample pairs.

6. The image matching method based on context information fusion of claim 5, wherein the twin network only calculates and outputs the similarity of the image block groups corresponding to the image blocks of the same object class, and the object classes of the image blocks are output by the image segmentation network.

7. An image matching apparatus based on context information fusion, comprising:

the image block division module is used for dividing image blocks to be matched from the image A, B by using an image division network and splicing each image block and the processed whole image into an image block group;

and the image block matching module is used for processing by adopting a Hungarian algorithm based on the similarity to obtain the matched image block in the image A, B.

8. The apparatus for matching image based on context information fusion according to claim 7, wherein the method for obtaining the group of image blocks includes:

9. The context information fusion-based image matching apparatus according to claim 7, wherein the twin network is trained using a training data set consisting of a positive sample pair and a negative sample pair; for two different perspective images M, N, M of the same scene_im,jmFor the ith in the image M_mJ-th of like object_mImage block group corresponding to each image block, N_in,jnFor the ith in image N_nJ-th of like object_nThe image block group corresponding to each image block is defined as i_m＝i_nAnd j is_m＝j_nWhen (M)_im,jm,N_in,jn) Is a positive sample pair; when i is_m≠i_nOr j_m≠j_nIn time (M)_im,jm,N_in,jn) Are negative sample pairs.

10. The context information fusion-based image matching apparatus according to claim 9, wherein the twin network calculates and outputs only similarities of image block groups corresponding to image blocks of a same object class, and the object classes of the image blocks are output by the image segmentation network.