CN111724443A

CN111724443A - Unified scene visual positioning method based on generating type countermeasure network

Info

Publication number: CN111724443A
Application number: CN202010517260.1A
Authority: CN
Inventors: 高伟; 韩胜; 吴毅红
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2020-06-09
Filing date: 2020-06-09
Publication date: 2020-09-29
Anticipated expiration: 2040-06-09
Also published as: CN111724443B

Abstract

The invention belongs to the field of visual positioning, and particularly relates to a unified scene visual positioning method, system and device based on a generative confrontation network, aiming at solving the problems of low positioning precision and poor robustness of the existing visual positioning method. The system method comprises the following steps: acquiring a query image and performing semantic segmentation to obtain a semantic tag map; splicing the semantic tag graph and the query image and translating; extracting a global descriptor and two-dimensional local features of the translated query image, and matching the global descriptor of the translated image with global descriptors of all images in an image library to obtain a candidate image; acquiring a three-dimensional model corresponding to the candidate image, and matching the two-dimensional local features with three-dimensional point clouds in a candidate image determination range in the three-dimensional model to obtain two-dimensional-three-dimensional matching point pairs; and calculating the pose of the 6-degree-of-freedom camera corresponding to the query image. The invention can obtain the high-precision camera pose corresponding to the query image and can realize robust visual positioning under long time span.

Description

Unified scene visual positioning method based on generating type countermeasure network

Technical Field

The invention belongs to the field of visual positioning, and particularly relates to a unified scene visual positioning method, system and device based on a generative confrontation network.

Background

Visual localization is a key technology in the field of three-dimensional space vision, and the core goal of the visual localization is to estimate the 6-degree-of-freedom pose of a camera under a global coordinate system. One of its major difficulties is how to deal with the challenges presented by the appearance changes of query images and database images over long time spans. Current common visual localization methods focus on extracting more robust features from the image to account for the effects of scene differences.

The existing mainstream visual positioning methods mainly comprise the following three methods:

(1) structure-based visual localization;

(2) image-based visual localization;

(3) learning-based visual positioning;

the method (1) focuses on directly matching the feature points on the query image with all the feature points stored in the three-dimensional model, so that the scale of data participating in operation is large, and only local features are considered, so that the method is difficult to have good robustness on environmental changes; the method (2) can be divided into two stages, wherein the early method is to use the extracted global descriptor for retrieval, then use the retrieved pose of the most similar database image as the pose of the query image, later, in order to improve the positioning accuracy, the image-based method gradually evolves to search several candidate images which are most similar to the query image in the images contained in the three-dimensional model, then match the feature points contained in the images, and use the obtained two-dimensional and three-dimensional matching points for pose calculation. Obviously, the methods (1) and (2) both rely heavily on feature extraction from images, which vary greatly over a long time span. Especially when applied in challenging scenes where lighting, weather or seasons vary widely, there is a higher demand on the robustness of the algorithm. The method (3) attempts to directly return the camera pose using an end-to-end method, but currently cannot achieve the same level of accuracy as the conventional method. In view of this, the present invention provides a unified scene visual positioning method based on a generative confrontation network.

Disclosure of Invention

In order to solve the above problems in the prior art, that is, to solve the problems of low positioning accuracy and poor robustness caused by scene change in the conventional visual positioning method under a long time span, a first aspect of the present invention provides a unified scene visual positioning method based on a generative countermeasure network, the method comprising:

step S100, acquiring a query image, and performing semantic segmentation on the query image through a semantic segmentation network to obtain a semantic tag map;

step S200, splicing the semantic label graph and the query image, translating by a generator of a pre-trained generation type countermeasure network, and taking the translated image as a first image;

step S300, extracting a global descriptor and two-dimensional local features of the first image; respectively matching the global descriptor of the first image with the global descriptors of all images in a preset image library to obtain candidate images; the image library is a database stored after semantic segmentation and generator translation of the image of the scene corresponding to the query image;

s400, acquiring a pre-constructed three-dimensional model corresponding to the candidate image; matching the two-dimensional local features with three-dimensional point clouds in a candidate image determination range in the three-dimensional model to obtain two-dimensional-three-dimensional matching point pairs;

step S500, based on each two-dimensional-three-dimensional matching point pair, calculating a 6-degree-of-freedom camera pose corresponding to the query image through a PnP-RANSAC frame;

the generative countermeasure network adopts bidirectional reconstruction loss, cycle consistency loss and countermeasure loss to optimize in the training process; the bidirectional reconstruction loss comprises L1 loss and MS-SSIM loss.

In some preferred embodiments, the training method of the generative countermeasure network is as follows:

acquiring a training sample set; the training sample set comprises a query image and a database image; the database image is an image of a scene corresponding to the query image;

performing semantic segmentation on the query image and the database image through a semantic segmentation network respectively, and splicing the query image and the database image with the database image; taking the spliced inquiry image as a second image and taking the spliced database image as a third image;

decomposing the second image and the third image into a content code and a style code respectively by a generator of the generative confrontation network;

recombining the style code and the content code of the second image with the style code and the content code of the third image and then decoding; and based on the decoded image, acquiring a corresponding loss value through a discriminator of the generative countermeasure network, and updating network parameters.

In some preferred embodiments, the method for calculating the loss value of the generative countermeasure network comprises:

wherein E is_AIs X_AEncoder of the domain, E_BIs X_BEncoder of the domain, G_AIs X_ADecoder of domains, G_BIs X_BDecoder of domains, D_AIs to try to distinguish X_ADiscriminator for intermediate translation image and real image, D_BIs to try to distinguish X_BA discriminator for the translated image and the real image,

indicating that the network is at X_AThe value of the loss of resistance of the domain,

indicating that the network is at X_BValue of the loss of antagonism of the Domain, L_cycRepresents that X is_A、X_BAfter the images in the domain are respectively translated into the opposite domain and then translated back to the original domain, the reconstruction loss value of the translated images relative to the original images,

represents X_AImage x in the domain_AThe reconstruction loss value relative to the original image after the encoding-decoding operation,

represents X_BImage x in the domain_BThe reconstruction loss value relative to the original image after the encoding-decoding operation,

represents X_AContent coding in the Domain c_AThe reconstruction loss value after the decoding-encoding operation is encoded relative to the original content,

represents X_BContent coding in the Domain c_BThe reconstruction loss value after the decoding-encoding operation is encoded relative to the original content,

represents X_AStylistic coding s in the domain_AThe reconstruction loss value after decoding-encoding operation relative to the original style encoding,

represents X_BStylistic coding s in the domain_BReconstruction loss value, lambda, relative to the original style code after a decoding-encoding operation_xyc、λ_x、λ_c、λ_sEach representing a weight, X, corresponding to each loss function_ARepresenting the corresponding scene of the query image in the generator, X_BRepresenting the corresponding scene of the database image in the generator.

In some preferred embodiments, the two-dimensional local features are extracted by SuperPoint.

In some preferred embodiments, in step S300, "match the global descriptor of the first image with the global descriptors of the images in a preset image library respectively to obtain candidate images", the method includes:

extracting global descriptors of the first image and images in a preset image library through a NetVLAD;

respectively calculating the L2 distance between the global descriptor of the first image and the global descriptor of each image in a preset image library;

and taking the images of the image library corresponding to the N minimum L2 distances as candidate images, wherein N is a positive integer.

In some preferred embodiments, the three-dimensional model is constructed by:

extracting local features of each image in the image library through SuperPoint;

based on the extracted local features, camera pose calibration is carried out through a motion recovery structure method SFM and sparse point cloud is generated;

and constructing a three-dimensional model based on the camera pose and the sparse point cloud.

The invention provides a unified scene visual positioning system based on a generative confrontation network, which comprises a semantic segmentation module, a translation module, a descriptor matching module, a point pair acquisition module and a camera pose calculation module;

the semantic segmentation module is configured to acquire a query image and perform semantic segmentation on the query image through a semantic segmentation network to obtain a semantic tag map;

the translation module is configured to splice the semantic tag graph and the query image, translate the semantic tag graph and the query image through a generator of a pre-trained generation type countermeasure network, and take the translated image as a first image;

the descriptor matching module is configured to extract a global descriptor and two-dimensional local features of the first image; respectively matching the global descriptor of the first image with the global descriptors of all images in a preset image library to obtain candidate images; the image library is a database stored after semantic segmentation and generator translation of the image of the scene corresponding to the query image;

the point pair obtaining module is configured to obtain a pre-constructed three-dimensional model corresponding to the candidate image; matching the two-dimensional local features with three-dimensional point clouds in a candidate image determination range in the three-dimensional model to obtain two-dimensional-three-dimensional matching point pairs;

the camera pose calculation is configured to calculate a 6-degree-of-freedom camera pose corresponding to the query image through a PnP-RANSAC frame based on each two-dimensional-three-dimensional matching point pair;

In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, and the program applications are loaded and executed by a processor to implement the above-mentioned unified scene visual positioning method based on the generative confrontation network.

In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is suitable for being loaded and executed by a processor to realize the unified scene visual positioning method based on the generative confrontation network.

The invention has the beneficial effects that:

according to the invention, the query image scene and the database image scene are unified into a standard scene, and the high-precision camera pose corresponding to the query image can be obtained through two steps of processing of image retrieval and local matching, and the robust visual positioning under a long time span can be realized. The invention unifies the images under different scenes into the standard scene through the cross-domain translation of the query image and the database image, effectively overcomes the challenges brought to visual positioning due to scene changes such as illumination, weather, seasons and the like under long-time span, and thereby enhances the robustness; meanwhile, through the coarse-to-fine grading positioning process combining global image retrieval and local feature matching, the camera high-precision 6-degree-of-freedom pose corresponding to the query image can be obtained, and high-precision pose calculation of the image after the scene is unified is achieved. On the test data set, the method has higher recall rate of the positioning threshold.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.

FIG. 1 is a flowchart illustrating a unified scene visual positioning method based on a generative confrontation network according to an embodiment of the present invention;

FIG. 2 is a block diagram of a unified scene vision positioning system based on a generative confrontation network according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a training framework for a generator of a generative confrontation network according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating the effects of a generator of a generative countermeasure network on an image before and after translation according to an embodiment of the present invention;

FIG. 5 is a detailed flowchart of a unified scene visual positioning method based on a generative confrontation network according to an embodiment of the present invention;

FIG. 6 is a schematic block diagram of a computer system suitable for use with the electronic device to implement an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The unified scene visual positioning method based on the generative confrontation network, as shown in fig. 1, comprises the following steps:

In order to more clearly describe the unified scene visual positioning method based on the generative countermeasure network of the present invention, the following will expand the detailed description of the steps in an embodiment of the method of the present invention with reference to the drawings.

In the following preferred embodiment, the training process of the generative confrontation network is detailed first, and then the acquisition of the 6-degree-of-freedom camera pose by the unified scene vision positioning method based on the generative confrontation network is detailed.

1. Training process for generative confrontation network

Step A100, obtaining a training sample set

In the present embodiment, the training sample images used are from the CMU-Seasons dataset. This dataset is a subset of the CMU Visual localization dataset established by Badino et al. It encompasses urban, suburban and park scenes of the pittsburgh area of the united states. Images were captured by two front cameras mounted on the car, which were pointed at about 45 degrees to the left/right front of the vehicle. The collected images have a time span of 1 year, wherein a certain collected data is taken as a database (image library), and collected data under other different seasonal conditions are used for inquiry. The park scene data of the data set includes images of different weather and seasons in the same place. Therefore, the generative confrontation network of the embodiment of the present invention is a suitable training data set, i.e., the training data set includes a query image and a database image. Among them, the generative countermeasure network in the present invention is preferably a UniGAN network.

Step A200, preprocessing the query image and the database image

In this embodiment, the selected query image and the database image are subjected to semantic segmentation, and then spliced with the image before segmentation, as shown in fig. 5. The method comprises the following specific steps:

training a semantic segmentation network to perform semantic segmentation by using two-dimensional matching point pairs between images shot under different scene conditions; constructing a semantic segmentation network based on a convolutional neural network;

after training is finished, inputting the query image and the database image into a trained semantic segmentation network, and outputting a semantic label graph which endows all pixels of the query image and the database image with semantic labels;

and taking the semantic tag images corresponding to the query image and the database image as a fourth channel, splicing the RGB three channels of the query image and the database image corresponding to the fourth channel to obtain a four-channel query image and a database image combined with the semantic tags, and enabling semantic information to assist image translation. Other parts in fig. 5 are described below.

Step A300, training the generating countermeasure network based on the four-channel query image and the database image.

In this embodiment, a framework for generating a countermeasure network is adopted, and based on an automatic encoder structure, a scene in which a four-channel query image and a database image are located is defined as two domains (X)_A，X_B) Setting an overall loss function, training the UniGAN network to perform translation training between two domains (namely translating the query image and the database image into a unified scene), and training the UniGAN network into a cross-domain image translation model by using images under a long time span. The method comprises the following specific steps:

the automatic encoder structure is based on the setting that an image x can be encoded into a content code c and a style code s in a potential space, namely, the image is decomposed into a content code and a style code; the basic framework is as follows:

for each domain X_i(i ═ a, B) are provided with an encoder E, respectively_iAnd decoder G_iSo that

G_i(c_i，s_i)＝x_i. Under this architecture, image translation can be achieved by swapping the encoder and decoder pairs.

As shown in FIG. 3, x_A(query image) is encoded into content code c by an auto-encoder_AStyle coding s_A，x_B(database images) are encoded by an auto-encoderContent encoding c_BStyle coding s_BCombining style coding and content coding in different domains, and then translating and decoding to obtain x_AB、x_BA. And obtaining the loss through a discriminator based on the translated image, and updating and optimizing network parameters.

The loss of the generational countermeasure network, i.e. the total loss, includes the bidirectional reconstruction loss, the cycle consistency loss and the countermeasure loss; the bi-directional reconstruction loss includes L1 loss and MS-SSIM loss.

The bidirectional reconstruction loss means that the model should be able to reconstruct an image x in the direction image → space → image_iAnd its potential spatial coding (c)_i，s_i) Should be reconstructed in the direction of landmark → image → landmark. The construction of the bidirectional reconstruction loss is shown in equations (1) (2) (3):

wherein,

represents X_AImage x in the domain_AThe reconstruction loss value relative to the original image after the encoding-decoding operation is α

Proportion of the loss of medium MS-SSIM, G_ARepresents X_ADecoder of domains, G_BIs X_BThe decoder of the domain or domains is,

represents X_AThe content encoder of the domain is arranged to,

represents X_AThe style of the field is coded by a style coder,

represents X_BThe content encoder of the domain is arranged to,

represents X_AStylistic coding s in the domain_AAnd (4) a reconstruction loss value relative to the original style coding after the decoding-coding operation.

Other bi-directional reconstruction loss terms: x_BImage x in the domain_BReconstruction loss value relative to original image after encoding-decoding operation

X_BContent coding in the Domain c_BReconstruction loss value relative to original content coding after decoding-coding operation

X_BStylistic coding s in the domain_BReconstruction loss value relative to original style coding after decoding-coding operation

Constructed in a similar manner.

The image bidirectional reconstruction loss term combines the L1 loss and the MS-SSIM loss to retain the color, brightness and contrast of the high-frequency region, and the MS-SSIM loss can be specifically referred to as the following documents: "ZHao H, Gallo O, Frosio I, et al, lossiness for image retrieval with a neural network [ J ]. IEEE Transactions on computational imaging,2016,3(1): 47-57", the present invention is not described in detail herein.

For convenience, the reconstruction loss (or differential loss) of the images m and n is briefly expressed below using the following function, as shown in equation (4):

the loss of cycle consistency means that the translated image generated by the model should have no resolvability from the real image in the target domain. The calculation is shown in equation (5):

L_cyc＝R^Mix(G_B→A(G_A→B(x_A))，x_A)+R^Mix(G_A→B(G_B→A(x_B))，x_B) (5)

L_cycrepresents that X is_A、X_BRespectively translating the images in the domain into a counterpart domain, and translating the images back to the original domain to translate the reconstruction loss value of the images relative to the original images; g_A→BRepresenting the image by X_ADomain translation to X_BGenerator of domain, with G_B(c_A，s_B) Equivalent, G_BRepresents X_BA decoder of the domain. The image is divided into X_BDomain translation to Domain X_AGenerator G of_B→ADefined in a similar manner.

The antagonism loss means that in the generation of an antagonism network framework, the capabilities of a generator and a discriminator are gradually strengthened in training, and finally a steady state is achieved. The construction is shown in equation (6):

wherein,

indicating that the network is at X_BValue of the loss of antagonism of the Domain, D_BIs to try to distinguish X_BMiddle translation image and real imageThe discriminator of (1). Discriminator D_AAnd network at X_AAntagonistic loss terms of domains

Defined in a similar manner.

The weighted sum of the bi-directional reconstruction loss, the cyclic consistency loss and the antagonism loss obtains the total loss of the generative antagonistic network, as shown in equation (7):

wherein λ is_cyc，λ_x，λ_c，λ_sThe weight corresponding to each loss function.

Fig. 4 is a comparison diagram of changes in appearance of a partial image before and after image translation. From the left column to the right column: the method comprises the steps of obtaining an original image (Origin) before translation, mapping semantic labels into semantic graphs (Semantics) of corresponding colors after semantic segmentation, and obtaining a unified scene image (Translated) after UniGAN translation. It can be seen that the translated images largely eliminate the difference brought to the scene by the environmental factors such as season, illumination and the like, so that the images under different scenes can be basically unified to the same standard scene firstly, and then the subsequent positioning step is executed, thereby improving the robustness to the scene change

2. Unified scene visual positioning method based on generating type countermeasure network

Step S100, acquiring a query image, and performing semantic segmentation on the query image through a semantic segmentation network to obtain a semantic tag map.

In the embodiment, a query image to be positioned in a current scene is acquired, and semantic segmentation is performed on the query image through a trained semantic segmentation network to obtain a semantic tag map.

And S200, splicing the semantic label graph and the query image, translating by a generator of a pre-trained generation type countermeasure network, and taking the translated image as a first image.

In this embodiment, the semantic tag graph is used as a fourth channel and is spliced with three RGB channels of the query image to obtain a four-channel query image, and the four-channel query image is translated by a generator of the pre-trained generation countermeasure network, and the translated image is used as the first image.

Step S300, extracting a global descriptor and two-dimensional local features of the first image; respectively matching the global descriptor of the first image with the global descriptors of all images in a preset image library to obtain candidate images; the image library is a database stored after semantic segmentation and generator translation of the image of the scene corresponding to the query image.

In this embodiment, the translated image, i.e. the first image, is searched and matched in a pre-constructed image library to determine candidate images. The method comprises the following specific steps:

extracting a global descriptor of the translated query image by using a NetVLAD, calculating and sequencing L2 distances between the descriptor of the translated query image and each image in a pre-constructed image library by taking an L2 distance between the global descriptors as a standard, and taking the image in the image library corresponding to the smallest N L2 distances as a candidate image, namely obtaining N candidate images from the pre-constructed image library for the translated image, wherein N is a positive integer, and the distance is preferably set to 10 in the invention. The image library is a database in the above, and is a database in which an image (database image) of a scene corresponding to the query image is subjected to semantic segmentation and generator translation and then stored. NetVLAD can be specifically referenced: "Relja Arandjelovic, Petr Gronat, Akihiko Tolii, Tomas Pajdla, and Josef Sivic," Netvlad: Cn architecture for weather superior discovery registration, "in Proceedings of the IEEE conference on computer vision and printer registration, 2016, pp.5297-5307", the present invention is not described in detail herein.

In addition, feature point detection and descriptor extraction are performed on the translated image (first image) through the SuperPoint, that is, two-dimensional local features are extracted. Superpoint can be specifically referenced: "Daniel DeTone, TomaszMalisiewicz, and Andrew Rabinovich," Superpoint: Self-provided interest point detection and description, "in Proceedings of the IEEE Conference on computer Vision and Pattern Recognition works, 2018, pp.224-236", the present invention is not described in detail herein.

S400, acquiring a pre-constructed three-dimensional model corresponding to the candidate image; and matching the two-dimensional local features with the three-dimensional point cloud in the candidate image determination range in the three-dimensional model to obtain two-dimensional-three-dimensional matching point pairs.

When a training generating type confrontation network is used, a three-dimensional model under a current scene is constructed according to a translation image corresponding to a database image generated by a network generator, and the specific construction method is as follows:

extracting two-dimensional local features of each image in an image library through Superpoint;

based on the extracted two-dimensional local features, camera pose calibration is carried out through a motion recovery structure method SFM (structure-from-motion), sparse point cloud (namely SFM point cloud) is generated, and sparse reconstruction is carried out to obtain a sparse three-dimensional model.

In this embodiment, a three-dimensional model pre-constructed in a scene corresponding to N candidate images is obtained, and in the three-dimensional model, a three-dimensional point cloud in a range determined by the candidate images is matched with two-dimensional local features to obtain a two-three dimensional matching point pair. The method comprises the steps of combining similar images into the same 'place' according to the three-dimensional poses corresponding to N candidate images, and obtaining the corresponding relation between two-dimensional points of a query image and three-dimensional points of point cloud through matching of two-dimensional feature points corresponding to the candidate 'place'.

And S500, calculating the position and posture of the 6-degree-of-freedom camera corresponding to the query image through a PnP-RANSAC frame based on each two-dimensional-three-dimensional matching point pair.

In this embodiment, based on each two-dimensional-three-dimensional matching point pair, a 6-degree-of-freedom camera pose corresponding to the query image is calculated by using a PnP-RANSAC framework (preferably, a solvepnp pransac () function provided by OpenCV is used in the present invention). A 6 Degree of freedom (DoF) Pose (dose), i.e. (x, y, z) coordinates, and angular yaw, pitch, roll around three coordinate axes.

In addition, the invention was evaluated on CMU-seasides data sets in order to prove the effectiveness of the unified scene visual localization method based on generative confrontation networks. As described above, the data set contains urban, suburban and park scenes of the pittsburgh area of the united states, with images spanning a1 year time span. Therefore, the method is a suitable research carrier for the research problem of the invention. In the evaluation process, the park scenes were used to train the UniGAN and the entire dataset was used for positioning accuracy evaluation. The evaluation results are shown in table 1:

TABLE 1

In table 1, folage denotes a multi-leaf scene, Mixed folage denotes a Mixed-leaf scene, No folage denotes a No-leaf scene, Urban denotes a Urban scene, Suburban denotes a Suburban scene, park denotes a park scene, distance denotes a distance error threshold in meters, and origin denotes an angle error threshold in degrees.

Recall [% ] of three different positioning methods at different distance and angle thresholds are shown in table 1. The first method represents that scene unification is not carried out, the NetVLAD (NV) is directly used for carrying out global image retrieval, and SuperPoint (SP) is used for carrying out local feature matching to realize hierarchical positioning from coarse to fine; the second method means that UniGAN is used for unifying scenes of an original three-channel (RGB) query image and a database image and then carrying out hierarchical positioning; the third method comprises the steps of firstly performing semantic segmentation on the query image and the database image, then performing scene unification on a four-channel (RGBS) query image and a database image which are combined with a semantic label graph by using UniGAN, and then performing hierarchical positioning.

As can be seen from the data in table 1, in an application environment with a long time span, especially in a high-precision threshold (0.25m, 2 °) and a scene greatly influenced by environmental factors, the positioning result of the method with unified scene is generally superior to that of the first method, and the third method combining semantic information is generally superior to that of the second method.

A unified scene visual positioning system based on a generative confrontation network according to a second embodiment of the present invention, as shown in fig. 2, includes: the system comprises a semantic segmentation module 100, a translation module 200, a descriptor matching module 300, a point pair acquisition module 400 and a camera pose calculation module 500;

the semantic segmentation module 100 is configured to obtain a query image, and perform semantic segmentation on the query image through a semantic segmentation network to obtain a semantic tag map;

the translation module 200 is configured to splice the semantic tag map and the query image, translate the semantic tag map and the query image by a generator of a pre-trained generative countermeasure network, and use the translated image as a first image;

the descriptor matching module 300 is configured to extract a global descriptor and two-dimensional local features of the first image; respectively matching the global descriptor of the first image with the global descriptors of all images in a preset image library to obtain candidate images; the image library is a database stored after semantic segmentation and generator translation of the image of the scene corresponding to the query image;

the point pair obtaining module 400 is configured to obtain a pre-constructed three-dimensional model corresponding to the candidate image; matching the two-dimensional local features with three-dimensional point clouds in a candidate image determination range in the three-dimensional model to obtain two-dimensional-three-dimensional matching point pairs;

the camera pose calculation 500 is configured to calculate a 6-degree-of-freedom camera pose corresponding to the query image through a PnP-RANSAC framework based on each two-dimensional-three-dimensional matching point pair;

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

It should be noted that, the unified scene visual positioning system based on the generative countermeasure network provided in the above embodiment is only illustrated by the division of the above functional modules, and in practical applications, the above functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the above embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the above described functions. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

A storage device according to a third embodiment of the present invention stores a plurality of programs, and the programs are adapted to be loaded by a processor and to implement the above-mentioned unified scene visual positioning method based on a generative confrontation network.

A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described unified scene vision localization method based on a generative confrontation network.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method examples, and are not described herein again.

Referring now to FIG. 6, there is illustrated a block diagram of a computer system suitable for use as a server in implementing embodiments of the method, system, and apparatus of the present application. The server in fig. 6 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present application.

As shown in fig. 6, the computer system includes a Central Processing Unit (CPU)601, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for system operation are also stored. The CPU 601, ROM 602, and RAM603 are connected to each other via a bus 604. An Input/Output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output section 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN (Local area network) card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU) 601. It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A unified scene visual positioning method based on a generative confrontation network is characterized by comprising the following steps:

2. The unified scene visual positioning method based on generative confrontation network as claimed in claim 1, wherein the training method of the generative confrontation network is:

3. The method for unified scene visual positioning based on generative confrontation network as claimed in claim 2, wherein the calculation method of loss value of the generative confrontation network is:

indicating that the network is at X_BValue of the loss of antagonism of the Domain, L_cycRepresents that X is_A、X_BThe images in the domain are translated into the opposite domain respectively and thenAfter translation back to the original domain, the reconstructed loss value of the translated image relative to the original image,

represents X_BStylistic coding s in the domain_BReconstruction loss value, lambda, relative to the original style code after a decoding-encoding operation_cyc、λ_x、λ_c、λ_sEach representing a weight, X, corresponding to each loss function_ARepresenting the corresponding scene of the query image in the generator, X_BRepresenting the corresponding scene of the database image in the generator.

4. The method according to claim 1, wherein the two-dimensional local features are extracted by SuperPoint.

5. The method as claimed in claim 4, wherein in step S300, "matching the global descriptor of the first image with the global descriptors of the images in a preset image library to obtain candidate images" is performed by:

6. The unified scene visual positioning method based on generative countermeasure network as claimed in claim 1, wherein the three-dimensional model is constructed by:

7. A unified scene vision positioning system based on a generative confrontation network, the system comprising: the system comprises a semantic segmentation module, a translation module, a descriptor matching module, a point pair acquisition module and a camera pose calculation module;

8. A storage device having stored therein a plurality of programs, wherein said program applications are loaded and executed by a processor to implement the method of unified scene visual positioning based on generative confrontation networks of any of claims 1 to 6.

9. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that said program is adapted to be loaded and executed by a processor to implement the method of unified scene visual positioning based on generative confrontation networks according to any of claims 1 to 6.