WO2023246537A1 - 导航、视觉定位以及导航地图构建方法和电子设备 - Google Patents
导航、视觉定位以及导航地图构建方法和电子设备 Download PDFInfo
- Publication number
- WO2023246537A1 WO2023246537A1 PCT/CN2023/099610 CN2023099610W WO2023246537A1 WO 2023246537 A1 WO2023246537 A1 WO 2023246537A1 CN 2023099610 W CN2023099610 W CN 2023099610W WO 2023246537 A1 WO2023246537 A1 WO 2023246537A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- image
- images
- pose
- map
- text
- Prior art date
Links
- 230000000007 visual effect Effects 0.000 title claims abstract description 307
- 238000000034 method Methods 0.000 title claims abstract description 158
- 238000010276 construction Methods 0.000 title claims abstract description 21
- 230000003190 augmentative effect Effects 0.000 claims abstract description 6
- 239000013598 vector Substances 0.000 claims description 66
- 230000008569 process Effects 0.000 claims description 52
- 238000000605 extraction Methods 0.000 claims description 27
- 238000004590 computer program Methods 0.000 claims description 20
- 239000000284 extract Substances 0.000 claims description 15
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 5
- 238000012545 processing Methods 0.000 abstract description 6
- 230000000694 effects Effects 0.000 description 26
- 238000010586 diagram Methods 0.000 description 23
- 230000001932 seasonal effect Effects 0.000 description 11
- 241001465754 Metazoa Species 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 7
- 238000001914 filtration Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 238000012549 training Methods 0.000 description 6
- 238000005286 illumination Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000013480 data collection Methods 0.000 description 3
- 238000005315 distribution function Methods 0.000 description 3
- 230000004807 localization Effects 0.000 description 3
- 235000013606 potato chips Nutrition 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 235000018893 Cercis canadensis var canadensis Nutrition 0.000 description 2
- 241000282320 Panthera leo Species 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 235000011888 snacks Nutrition 0.000 description 2
- 241000246150 Cercis Species 0.000 description 1
- 240000000024 Cercis siliquastrum Species 0.000 description 1
- 241000272194 Ciconiiformes Species 0.000 description 1
- 241000196324 Embryophyta Species 0.000 description 1
- 241000282819 Giraffa Species 0.000 description 1
- 230000003213 activating effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 235000013305 food Nutrition 0.000 description 1
- 235000021022 fresh fruits Nutrition 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 235000013311 vegetables Nutrition 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T19/00—Manipulating 3D models or images for computer graphics
- G06T19/003—Navigation within 3D models or images
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01C—MEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
- G01C21/00—Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01C—MEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
- G01C21/00—Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
- G01C21/38—Electronic maps specially adapted for navigation; Updating thereof
- G01C21/3804—Creation or updating of map data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/29—Geographical information databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/53—Querying
- G06F16/535—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9537—Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T19/00—Manipulating 3D models or images for computer graphics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/70—Determining position or orientation of objects or cameras
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30244—Camera pose
Definitions
- Embodiments of the present application relate to the field of data processing, and in particular to a navigation, visual positioning and navigation map construction method and electronic device.
- AR Augmented Reality, augmented reality map
- AR Augmented Reality, augmented reality map
- holographic information display allowing users to see various information signs and detailed introductions of virtual and real integration
- intelligent search allowing users to It can easily find the legendary internet celebrity check-in points, the nearest restroom, etc.
- AR visual navigation novigation that allows users to intuitively see the AR effect in real time
- AR interaction which allows users to take photos with virtual characters and navigate in the AR world Participate in a variety of virtual activities, etc.
- this application provides a navigation, visual positioning and navigation map construction method and electronic device.
- the user can be supported to input multi-modal location search information, and then AR visual navigation can be provided to the user based on the multi-modal location search information input by the user in the AR map; in this way, the diversity of navigation search input methods can be increased , improve user navigation experience.
- embodiments of the present application provide a navigation method, which method includes: first, obtaining the location search information input by the user in the augmented reality AR map, where the location search information includes at least one of the following: text, voice, or image; and then , based on the location search information, conduct a multi-modal search in the preset multi-modal information to determine the location search results that match the location search information; then, perform AR visual navigation based on the location search results.
- this application can support users to input location search information in multiple modes.
- the input methods for navigation search are diverse, which can improve the user's navigation experience.
- the location search information may also be text and voice, or text and image, or voice and image, or text, voice, and image, and this application is not limited thereto.
- the AR map can be an application or applet in the device, and this application does not limit this.
- the location search result may be one or multiple.
- AR visual navigation can be performed based on the location search results selected by the user; it can also be performed according to preset rules (such as communicating with the user The closest distance, best evaluation, etc.) Select a location search result for AR visual navigation; this application does not limit this.
- the multimodal information includes entity identification information of multiple entities, and the entity identification information includes entity identification text and/or entity identification image; based on the location search information, multimodality is performed in preset multimodal information. Searching to determine the location search results that match the location search information, including: performing feature extraction on the location search information to obtain the first feature corresponding to the location search information; performing feature extraction on multiple entity identification information to obtain multiple entity identifications Second features respectively corresponding to the information; determining the first feature distance between the second features respectively corresponding to the plurality of entity identification information and the first feature; selecting the corresponding first feature distance from the plurality of entity identification information to be smaller than the first feature The entity identification information of the distance threshold is used as the location search result.
- cross-modal search can be realized, that is, when the location search information is text or image, search is performed from the entity identification text and entity identification image contained in the preset multi-modal information to find entities matching the location search information.
- Identification text and/or entity identification image as a location search result.
- the location search information is voice
- Text and/or entity identification images are used as location search results; thereby enabling mutual search between images and text.
- "fuzzy" search can be realized, which can cover more comprehensive entity identification information as much as possible to avoid omissions.
- multi-modal search may also include same-modal search.
- the same-modal search process can be as follows: when the location search information is text, you can search from the entity identification text contained in the preset multi-modal information to find the entity identification that matches the location search information. Text, as a location search result.
- the location search information is voice, you can first perform voice recognition to obtain the recognition text; then search from the entity identification text contained in the preset multi-modal information to find the entity identification text that matches the recognition text as the location search results.
- the location search information is an image, you can search from the entity identification images included in the preset multi-modal information to find the entity identification image that matches the location search information as a location search result.
- the location search information includes at least one of the following: entity name text, entity name voice, or entity image; entities include: places and/or objects contained in the places.
- entity name text, entity name voice, or entity image entities include: places and/or objects contained in the places.
- the AR map can also provide the user with AR visual navigation when the user inputs object image or object name text or object name voice. navigation. Furthermore, the user can be quickly guided to the location of the entity he desires.
- the place is a supermarket
- the objects contained in the place are the goods sold in the supermarket; in this way, whether the user inputs the name/image of the supermarket, or the name/image of the goods sold in the supermarket, the AR map can quickly guide the user to where he or she is. Desired supermarket location.
- the place is a zoo
- the objects contained in the place are animals in the zoo; in this way, whether the user inputs the name/image of the zoo or the name/image of the animal in the zoo, the AR map can quickly guide the user to the desired location.
- determining the classification of multiple entity identification information includes: for the first entity identification information among the plurality of entity identification information, the first entity identification information includes the entity identification text and the entity identification image: The first feature distance between the second feature corresponding to the entity identification text contained in an entity identification information and the first feature, and the first feature distance between the second feature corresponding to the entity identification image contained in the first entity identification information and the first feature.
- the first feature distance is weighted and calculated; the result of the weighted calculation is used as the first feature distance between the second feature corresponding to the first entity identification information and the first feature.
- performing AR visual navigation based on the location search results includes: based on a pre-generated 2D visual navigation map, generating a target location between the user's current location and the location search results. 2D navigation path between; perform visual positioning to determine the target pose, which refers to the current pose of the device; perform AR visual navigation based on the 2D navigation path and target pose.
- the device when collecting the first image, the device may be a mobile phone, or may be an AR device or other device that can collect images, and this application does not limit this.
- performing visual positioning to determine the target pose includes: collecting a first image; extracting the first text in the first image; and extracting the first text of the first image. a global feature vector; image retrieval is performed based on the first text, the first global feature vector and the preset second text in the plurality of second images and the second global feature vectors of the plurality of second images to retrieve the plurality of second images from the first text.
- the third image matching the first image is selected from the two images.
- Multiple second images are collected during the process of constructing the visual positioning map.
- the 2D visual navigation map is generated based on the visual positioning map; based on the first image and the third image, Determine the target pose.
- the target pose refers to the pose of the device when the first image is collected.
- this application uses multiple modalities of information such as images and text with high-level semantics in images to perform visual positioning, which can effectively improve the success rate of visual positioning in these scenarios.
- the first global feature vector and the second text in the plurality of preset second images and the second text in the plurality of second images performs image retrieval to select a third image matching the first image from the plurality of second images, including: selecting a third image containing the first image from the plurality of second images based on the second text in the plurality of second images.
- Multiple fourth images of the text respectively determine the second global feature vector of the multiple fourth images and the second feature distance between the first global feature vector; select the corresponding second feature distance from the multiple fourth images A third image smaller than the second distance threshold.
- image retrieval can be implemented, and a third image matching the first image can be retrieved from a plurality of second images (wherein, the second image matching the first image may refer to a shooting angle similar to the first image and the shooting angle is similar to the first image). second image at a similar distance).
- the accuracy and retrieval speed of image retrieval can be improved.
- extracting the first global feature vector of the first image includes: determining the target area corresponding to the first text in the first image; adding trained feature extraction The weight corresponding to the target area in the network layer of the network; input the first image to the feature extraction network to obtain the first global feature vector output by the feature extraction network. In this way, the accuracy of the features corresponding to the first text in the first global feature vector can be increased, and thereby the accuracy of the third image selected by the second filtering can be improved.
- collecting the first image includes: collecting K first images during the rotation of the device, and each first image matches M third images.
- K is an integer greater than 1
- M is a positive integer
- determining the target pose includes: determining the K first images based on the M third images matched by the K first images respectively.
- the single-frame confidence corresponding to the N candidate poses and N candidate poses respectively, N is a positive integer; according to the N candidate poses and N candidate poses corresponding to the K first images.
- determining the target pose based on the N candidate poses corresponding to the K first images and the single frame confidences corresponding to the N candidate poses It includes: traversing N candidate poses corresponding to K first images, selecting one candidate pose from the N candidate poses corresponding to any two first images to form a pose combination, so as to obtain multiple poses.
- first pose combination for a first pose combination, determine that the first pose combination corresponds to the SLAM (Simultaneous Localization and Mapping) pose between the two first images, and the two first pose combinations Relative poses between candidate poses; determine the pose error between the SLAM pose and the relative pose; if there is a candidate pose combination with a pose error smaller than the preset error, the single position corresponding to the candidate pose combination is Frame confidence determines the joint confidence corresponding to the candidate pose combination; the candidate pose combination with the highest joint confidence is determined as the target pose.
- SLAM Simultaneous Localization and Mapping
- the candidate pose combination with a pose error smaller than the preset error select one of the N candidate poses corresponding to the K first images.
- the candidate pose with the highest frame confidence is determined as the target pose.
- determining the target pose based on the first image and the third image includes: determining the single frame corresponding to the first image used for this visual positioning with the highest confidence R first candidate poses, R is a positive integer; add the R second candidate poses with the highest confidence in a single frame corresponding to the first image used in the last visual positioning, and add SLAM poses respectively to obtain the R second candidate poses.
- Three candidate poses determine the probabilities corresponding to the R third candidate poses, and the probabilities corresponding to the R first candidate poses; determine the first candidate pose or the third candidate pose with the highest probability as the target Posture.
- multiple candidate poses determined in the last visual positioning are combined to perform this visual positioning to reduce the error of single-frame visual positioning and improve the success rate of visual positioning.
- it can improve the success rate of visual positioning in scenes such as lighting changing scenes/seasonal changing scenes/perspective scale changing scenes/repeated texture scenes/weak texture scenes.
- the method further includes: collecting place identification information of the place, where the place identification information includes place identification text and/or place identification graphics; A multi-modal search is performed in the second image to determine the fifth image containing place identification information. Multiple second images are collected in the process of constructing a visual positioning map; based on the place identification information in the fifth image, the visual location of the place is determined. 3D (3-dimension, three-dimensional) coordinates in the positioning map; map the 3D coordinates of the place in the visual positioning map to the 2D (2-dimension, two-dimensional) visual navigation map to obtain the place in the 2D visual navigation map 2D coordinates. In this way, the location of the place can be registered to the visual positioning image and the 2D visual navigation map, thereby providing AR visual navigation when the user uses the AR map navigation.
- the method further includes: collecting map reconstruction data in the place, performing three-dimensional reconstruction based on the map reconstruction data to update the visual positioning map, and the map reconstruction data includes the map reconstruction data in the place.
- the sixth image, the place contains objects of multiple categories; extract the 2D feature points corresponding to the category identification text in the sixth image, and determine the 3D point cloud of the 2D feature points corresponding to the category identification text in the SLAM coordinate system; convert the SLAM coordinates
- the 3D point cloud in the system is mapped to the 3D point cloud in the coordinate system of the visual positioning map corresponding to the updated visual positioning map. In this way, a visual positioning map within the venue can be constructed to subsequently guide users to quickly reach the locations of various objects in larger venues.
- a visual positioning map within the place can be constructed; for a small place, there is no need to construct a visual positioning map within the place.
- the method further includes: collecting category identification information of the category, where the category identification information includes category identification text and/or category identification graphics; performing multiple steps in the sixth image.
- Modal search to determine the seventh image containing category identification information; determine the 3D coordinates of the category object in the updated visual positioning map based on the category identification information in the seventh image; update based on the updated visual positioning map 2D visual navigation map; map the 3D coordinates of the objects contained in the category in the visual positioning map to the updated 2D visual navigation map to obtain the 2D coordinates of the category objects in the updated 2D visual navigation map.
- the locations of various types of objects in the place can be registered to the visual positioning image and the 2D visual navigation map, and then when the user uses the AR map navigation, AR visual navigation can be provided to guide the user to quickly reach various types of objects in the larger place.
- the location of the object can be registered to the visual positioning image and the 2D visual navigation map, and then when the user uses the AR map navigation, AR visual navigation can be provided to guide the user to quickly reach various types of objects in the larger place.
- embodiments of the present application provide a visual positioning method, which method includes: first, collecting a first image; then, extracting the first text in the first image and extracting the first global feature vector of the first image; and then , perform image retrieval based on the first text, the first global feature vector and the preset second text in the plurality of second images and the second global feature vectors of the plurality of second images to select from the plurality of second images.
- the third image matched with the first image, and multiple second images are collected in the process of constructing the visual positioning map; then, based on the first image and the third image, the target pose is determined, and the target pose refers to collecting the first image The posture of the device at that time.
- image retrieval is performed based on the first text, the first global feature vector and the preset second text in the plurality of second images and the second global feature vectors of the plurality of second images to retrieve the plurality of second images from the first text.
- Selecting a third image matching the first image from the two images includes: selecting a plurality of fourth images containing the first text from the plurality of second images based on the second text in the plurality of second images; determining the plurality of fourth images respectively.
- image retrieval can be implemented, and a third image matching the first image can be retrieved from a plurality of second images (wherein, the second image matching the first image may refer to a shooting angle similar to the first image and the shooting angle is similar to the first image). second image at a similar distance).
- the accuracy and retrieval speed of image retrieval can be improved.
- extracting the first global feature vector of the first image includes: determining the target area corresponding to the first text in the first image; adding trained feature extraction The weight corresponding to the target area in the network layer of the network; input the first image to the feature extraction network to obtain the first global feature vector output by the feature extraction network. In this way, the accuracy of the features corresponding to the first text in the first global feature vector can be increased, and thereby the accuracy of the third image selected by the second filtering can be improved.
- collecting the first image includes: collecting K first images during the rotation of the device, and each first image matches M third images.
- K is an integer greater than 1
- M is a positive integer
- determining the target pose includes: determining the K first images based on the M third images matched by the K first images respectively.
- the single-frame confidence corresponding to the N candidate poses and N candidate poses respectively, N is a positive integer; according to the N candidate poses and N candidate poses corresponding to the K first images.
- the target pose is determined based on the N candidate poses corresponding to the K first images and the single frame confidence levels corresponding to the N candidate poses, It includes: traversing N candidate poses corresponding to K first images, selecting one candidate pose from the N candidate poses corresponding to any two first images to form a pose combination, so as to obtain multiple poses. pose combination; for a first pose combination, it is determined that the first pose combination corresponds to the simultaneous positioning and mapping SLAM pose between the two first images, and the position between the two candidate poses in the first pose combination.
- Relative pose determine the pose error between the SLAM pose and the relative pose; if there is a candidate pose combination whose pose error is less than the preset error, determine the candidate pose based on the single frame confidence level corresponding to the candidate pose combination.
- the joint confidence corresponding to the pose combination; the candidate pose combination with the highest joint confidence is determined as the target pose.
- the method further includes: if there is no candidate pose combination with a pose error smaller than the preset error, then N corresponding to the K first images The candidate pose with the highest confidence in a single frame among the candidate poses is determined as the target pose.
- determining N candidate poses corresponding to the K first images respectively includes: Target image among K first images: perform co-view clustering on M third images matching the target image to obtain N groups of third images; based on the target image and N groups of third images, determine N corresponding to the target image candidate poses.
- determining N candidate poses corresponding to the K first images respectively includes: Target image among K first images: Based on the target image and M third images matching the target image, determine M candidate poses corresponding to the target image; perform clustering based on the M candidate poses corresponding to the target image, to Get N candidate poses corresponding to the target image.
- determining the target pose based on the first image and the third image includes: determining the single frame corresponding to the first image used for this visual positioning with the highest confidence R first candidate poses, R is a positive integer; add the R second candidate poses with the highest confidence in the single frame corresponding to the first image used in the last visual positioning, and add SLAM poses respectively to obtain R The third candidate pose; determine the probabilities corresponding to the R third candidate poses, and the probabilities corresponding to the R first candidate poses; determine the first candidate pose or the third candidate pose with the highest probability as target pose.
- multiple candidate poses determined in the last visual positioning are combined to perform this visual positioning to reduce the error of single-frame visual positioning and improve the success rate of visual positioning.
- it can improve the success rate of visual positioning in scenes such as lighting changing scenes/seasonal changing scenes/perspective scale changing scenes/repeated texture scenes/weak texture scenes.
- embodiments of the present application provide a navigation map construction method.
- the method includes: collecting place identification information of a place, where the place identification information includes place identification text and/or place identification graphics; Perform multi-modal retrieval in order to determine the second image containing place identification information. Multiple first images are collected in the process of constructing the visual positioning map; according to the place identification information in the second image, determine that the place is in the visual positioning map.
- the 3D coordinates of the place in the visual positioning map are mapped to the 2D visual navigation map to obtain the 2D coordinates of the place in the 2D visual navigation map.
- the 2D visual navigation map is generated based on the visual positioning map. In this way, the location of the place can be registered to the visual positioning image and the 2D visual navigation map, thereby providing AR visual navigation when the user uses the AR map navigation.
- map reconstruction data in the place is collected, the method further includes: performing three-dimensional reconstruction according to the map reconstruction data to update the visual positioning map, the map reconstruction data includes a third image in the place, and the place contains multiple categories of objects ; Extract the 2D feature points corresponding to the category identification text in the third image, and determine the 3D point cloud of the 2D feature points corresponding to the category identification text in the SLAM coordinate system; map the 3D point cloud in the SLAM coordinate system to update The final visual positioning map corresponds to the 3D point cloud in the coordinate system of the visual positioning map.
- a visual positioning map within the venue can be constructed to subsequently guide users to quickly reach the locations of various objects in larger venues.
- the method further includes: obtaining category identification information of the category, where the category identification information includes category identification text and/or category identification graphics; performing multiple processing in the third image.
- the locations of various types of objects in the place can be registered to the visual positioning image and the 2D visual navigation map, and then when the user uses the AR map navigation, AR visual navigation can be provided to guide the user to quickly reach various objects in the larger place. location.
- first image in the third aspect and any implementation manner of the third aspect, and the second image in the first method and any implementation manner of the third aspect are the same image with different names.
- the second image in the third aspect and any implementation manner of the third aspect, and the fifth image in the first method or any implementation manner of the third aspect are the same image with different names.
- the third image in the third aspect and any implementation manner of the third aspect, and the sixth image in the first method or any implementation manner of the third aspect are the same image with different names.
- the fourth image in the third aspect and any implementation manner of the third aspect, and the seventh image in the first method or any implementation manner of the third aspect are the same image with different names.
- embodiments of the present application provide an electronic device, including: a memory and a processor, the memory is coupled to the processor; the memory stores program instructions, and when the program instructions are executed by the processor, the electronic device executes the first aspect or A navigation method in any possible implementation of the first aspect.
- the fourth aspect and any implementation manner of the fourth aspect respectively correspond to the first aspect and any implementation manner of the first aspect.
- the technical effects corresponding to the fourth aspect and any implementation manner of the fourth aspect please refer to the technical effects corresponding to the above-mentioned first aspect and any implementation manner of the first aspect, which will not be described again here.
- embodiments of the present application provide an electronic device, including: a memory and a processor, the memory is coupled to the processor; the memory stores program instructions, and when the program instructions are executed by the processor, the electronic device executes the second aspect or The visual positioning method in any possible implementation of the second aspect.
- the fifth aspect and any implementation manner of the fifth aspect respectively correspond to the second aspect and any implementation manner of the second aspect.
- the technical effects corresponding to the fifth aspect and any implementation manner of the fifth aspect please refer to the technical effects corresponding to the above-mentioned second aspect and any implementation manner of the second aspect, which will not be described again here.
- embodiments of the present application provide an electronic device, including: a memory and a processor, the memory is coupled to the processor; the memory stores program instructions, and when the program instructions are executed by the processor, the electronic device executes the third aspect or The navigation map construction method in any possible implementation of the third aspect.
- the sixth aspect and any one of the implementation methods of the sixth aspect are respectively the same as the third aspect and any one of the third aspects. corresponding to the implementation methods.
- the technical effects corresponding to the sixth aspect and any implementation manner of the sixth aspect please refer to the technical effects corresponding to the above-mentioned third aspect and any implementation manner of the third aspect, which will not be described again here.
- embodiments of the present application provide a chip, including one or more interface circuits and one or more processors; the interface circuit is used to receive signals from the memory of the electronic device and send signals to the processor, and the signals include the memory computer instructions stored in the processor; when the processor executes the computer instructions, the electronic device is caused to perform the navigation method in the first aspect or any possible implementation of the first aspect.
- the seventh aspect and any implementation manner of the seventh aspect respectively correspond to the first aspect and any implementation manner of the first aspect.
- the technical effects corresponding to the seventh aspect and any implementation manner of the seventh aspect may be referred to the technical effects corresponding to the above-mentioned first aspect and any implementation manner of the first aspect, and will not be described again here.
- embodiments of the present application provide a chip, including one or more interface circuits and one or more processors; the interface circuit is used to receive signals from the memory of the electronic device and send signals to the processor, and the signals include the memory computer instructions stored in; when the processor executes the computer instructions, the electronic device is caused to perform the visual positioning method in the second aspect or any possible implementation of the second aspect.
- the eighth aspect and any implementation manner of the eighth aspect respectively correspond to the second aspect and any implementation manner of the second aspect.
- the technical effects corresponding to the eighth aspect and any implementation manner of the eighth aspect may be referred to the technical effects corresponding to the above-mentioned second aspect and any implementation manner of the second aspect, and will not be described again here.
- embodiments of the present application provide a chip, including one or more interface circuits and one or more processors; the interface circuit is used to receive signals from the memory of the electronic device and send signals to the processor, and the signals include the memory Computer instructions stored in the processor; when the processor executes the computer instructions, the electronic device is caused to execute the navigation map construction method in the third aspect or any possible implementation of the third aspect.
- the ninth aspect and any implementation manner of the ninth aspect respectively correspond to the third aspect and any implementation manner of the third aspect.
- the technical effects corresponding to the ninth aspect and any implementation manner of the ninth aspect please refer to the technical effects corresponding to the above-mentioned third aspect and any implementation manner of the third aspect, which will not be described again here.
- embodiments of the present application provide a computer-readable storage medium.
- the computer-readable storage medium stores a computer program.
- the computer program When the computer program is run on a computer or processor, it causes the computer or processor to execute the first aspect or the first aspect. Navigation methods in any possible implementation on the one hand.
- the tenth aspect and any implementation manner of the tenth aspect respectively correspond to the first aspect and any implementation manner of the first aspect.
- the technical effects corresponding to the tenth aspect and any implementation manner of the tenth aspect may be referred to the technical effects corresponding to the above-mentioned first aspect and any implementation manner of the first aspect, and will not be described again here.
- embodiments of the present application provide a computer-readable storage medium.
- the computer-readable storage medium stores a computer program.
- the computer program When the computer program is run on a computer or processor, it causes the computer or processor to execute the second aspect or The visual positioning method in any possible implementation of the second aspect.
- the eleventh aspect and any implementation manner of the eleventh aspect are respectively the same as the second aspect and any implementation method of the second aspect. Corresponds to an implementation method.
- the technical effects corresponding to the eleventh aspect and any implementation manner of the eleventh aspect may be referred to the technical effects corresponding to the above-mentioned second aspect and any implementation manner of the second aspect, and will not be described again here.
- embodiments of the present application provide a computer-readable storage medium.
- the computer-readable storage medium stores a computer program.
- the computer program When the computer program is run on a computer or processor, it causes the computer or processor to execute the third aspect or The navigation map construction method in any possible implementation of the third aspect.
- the twelfth aspect and any implementation manner of the twelfth aspect respectively correspond to the third aspect and any implementation manner of the third aspect.
- the technical effects corresponding to the twelfth aspect and any one of the implementation methods of the twelfth aspect please refer to the technical effects corresponding to the above-mentioned third aspect and any one implementation method of the third aspect, which will not be described again here.
- inventions of the present application provide a computer program product.
- the computer program product includes a software program.
- the software program When executed by a computer or processor, it causes the computer or processor to execute the first aspect or the first aspect. Navigation methods in any possible implementation of an aspect.
- the thirteenth aspect and any implementation manner of the thirteenth aspect respectively correspond to the first aspect and any implementation manner of the first aspect.
- the technical effects corresponding to the thirteenth aspect and any implementation manner of the thirteenth aspect may be referred to the technical effects corresponding to the above-mentioned first aspect and any implementation manner of the first aspect, and will not be described again here.
- embodiments of the present application provide a computer program product.
- the computer program product includes a software program.
- the software program is executed by a computer or processor, the computer or processor executes the second aspect or the second aspect.
- a visual positioning method in any possible implementation of the aspect.
- the fourteenth aspect and any implementation manner of the fourteenth aspect respectively correspond to the second aspect and any implementation manner of the second aspect.
- the technical effects corresponding to the fourteenth aspect and any implementation manner of the fourteenth aspect please refer to the technical effects corresponding to the above-mentioned second aspect and any implementation manner of the second aspect, which will not be described again here.
- inventions of the present application provide a computer program product.
- the computer program product includes a software program.
- the software program When executed by a computer or processor, it causes the computer or processor to execute the third aspect or the third aspect.
- Navigation map construction method in any possible implementation of the aspect.
- the fifteenth aspect and any implementation manner of the fifteenth aspect respectively correspond to the third aspect and any implementation manner of the third aspect.
- the technical effects corresponding to the fifteenth aspect and any implementation method of the fifteenth aspect please refer to the technical effects corresponding to the above-mentioned third aspect and any implementation method of the third aspect, which will not be described again here.
- Figure 1 is a schematic diagram of an exemplary application scenario
- Figure 2 is a schematic diagram of an exemplary navigation process
- Figure 3a is an exemplary interface schematic diagram
- Figure 3b is an exemplary interface schematic diagram
- Figure 4 is a schematic diagram of an exemplary navigation process
- Figure 5 is a schematic diagram of an exemplary visual positioning process
- Figure 6 is a schematic diagram of an exemplary visual positioning process
- Figure 7 is a schematic diagram of an exemplary visual positioning process
- Figure 8 is a schematic diagram of an exemplary navigation map construction process
- Figure 9 is a schematic diagram of an exemplary navigation map construction process
- Figure 10 is a schematic diagram of an exemplary navigation map construction process
- Figure 11 is a schematic structural diagram of an exemplary device.
- a and/or B can mean: A exists alone, A and B exist simultaneously, and they exist alone. B these three situations.
- first and second in the description and claims of the embodiments of this application are used to distinguish different objects, rather than to describe a specific order of objects.
- first target object, the second target object, etc. are used to distinguish different target objects, rather than to describe a specific order of the target objects.
- multiple processing units refer to two or more processing units; multiple systems refer to two or more systems.
- Figure 1 is a schematic diagram of an exemplary application scenario. In the embodiment of Figure 1, a navigation scenario is shown.
- the AR map which can be an application or applet
- the user can start the AR map and enter the navigation search interface 101, as shown in Figure 1(1).
- the navigation search interface 101 may include one or more controls, including but not limited to: an edit box 102 and a modal selection button 103, etc. This application is not limited to this.
- the AR map can display the location search result list 108 in response to the user operation, as shown in Figure 1(3).
- Figure 1(3) For example, after the user enters "dormitory" in the edit box 102, multiple location search results are displayed in the location search result list 108, such as: "B1 Dormitory Building", “B2 Dormitory Building” and “B3 Dormitory Building” are shown in Figure 1(3).
- the AR map can respond to user operations, start the camera and enter the AR visual navigation interface 109, as shown in Figure 1(4).
- the AR visual navigation interface 109 may include navigation prompt information options 110, navigation guide signs 111, and images collected by the camera (including trees, buildings, and roads in the AR visual navigation interface 109).
- the AR map can continuously update the navigation prompt information in the navigation prompt information option 110 and the navigation guide mark 111 based on the images collected by the camera.
- this application in addition to supporting users to input location search information in the above text form, this application also supports users to input location search information in other modalities (such as voice, images, etc.) to increase the variety of input methods for navigation search, thereby enabling Improve user navigation experience.
- other modalities such as voice, images, etc.
- the AR map can display the modal in response to the user's operation.
- the status selection window 104 is shown in Figure 1(2).
- the modal selection window 104 may include multiple modal input options, including but not limited to the voice input option 105, the image input option 106, the photo taking option 107, and so on.
- the user needs to use voice input to search for information, he can click the voice input option 105; in this way, the AR map can start the recording module in response to the user's operation, and the user can perform voice input at this time.
- the AR map can respond to the user's operation and enter the photo album.
- the user can select the image he wants to input from the photo album.
- the photo taking option 107 when the user needs to use the image as location search information, he can click the photo taking option 107; in this way, the AR map can respond to the user's operation, start the camera and enter the photo taking interface. At this time, the user can obtain the desired input image by taking a photo.
- this application proposes a navigation method that can search based on multi-modal location search information input by the user and provide AR visual navigation for the user.
- Figure 2 is a schematic diagram illustrating an exemplary navigation process.
- the location search information includes at least one of the following: text, voice or image.
- the user when the user needs navigation, he can start the AR map in the device and enter the navigation search interface 101 in Figure 1(1).
- the user can enter location search information in the form of text in the edit box in the navigation search interface 101; or he can click the modal selection button 103 in Figure 1(1) and enter the modal selection window in Figure 1(2).
- the user can input location search information in any two or three forms of text, voice, and image at the same time on the navigation search interface 101, and this application does not limit this.
- the location search information may refer to information used to search for a location, and may include but is not limited to: entity name text, entity name voice, entity image, address information, etc.
- entity can refer to something that can exist independently and is the basis of all attributes and the origin of all things. For example, objective existence and things that are distinguishable from each other.
- Entities can include places (such as stores, zoos, botanical gardens, etc.) and objects contained in the places (such as goods in stores, animals in zoos, plants in botanical gardens, etc.).
- the entity name text/entity name voice may include place name text/voice, such as the text/voice of "** Store”, the text/voice of "** Zoo”, the text/voice of "** Botanical Garden", etc. .
- the entity name text/entity name voice may include the object name text/voice contained in the place, such as the text/voice of "** brand canvas shoes", the text/voice of "Lion”, the text/voice of "Redbud” Voice, etc.
- the entity image may include a place image, such as a store door number image, a zoo gate image, a botanical garden gate image, etc.
- the entity image may include an object image contained in the place, such as an image of a famous brand of canvas shoes, an image of a lion, an image of a redbud flower, etc.
- the AR map can obtain the location search information input by the user; and then S202 can be performed.
- entity identification information of multiple entities may be collected in advance, and the entity identification information may be associated with corresponding entity addresses.
- the entity identification text of the entity can be collected, the entity identification image of the entity can also be collected, and the entity identification text and entity identification image of the entity can also be collected; this application does not limit this.
- the entity identification text may refer to text that can be used to identify the entity, and the entity identification image may refer to an image containing entity identification text/graphics.
- the entity identification information may include place identification information and object identification information.
- the place identification information may include place identification text and/or place identification image
- the object identification information may include object identification text and/or object identification image.
- the place identification information of any place and the object identification information of the objects contained in the place can also be associated; and the place identification information of any place and the object identification information of the objects contained in the place can be associated with the object identification information of the place. address to associate.
- the place is a large supermarket, and the objects contained in the place are goods sold in the supermarket;
- the object identification information (such as product names, product images, etc.) can be associated with the place identification information (such as supermarket name, supermarket door number image, etc.)
- the objects Identification information (such as product name, product image, etc.) and place identification information (such as supermarket name, supermarket door number image, etc.) are associated with the supermarket address.
- the place is a large zoo
- the objects contained in the place are animals in the zoo
- the object identification information (such as animal names, animal images, etc.) can be associated with the place identification information (such as zoo name, zoo door number image, etc.)
- the objects Identification information (such as animal name, animal image, etc.) and place identification information (such as zoo name, zoo door number image, etc.) are associated with the zoo address.
- AR map can quickly provide users with accurate navigation.
- the collected entity identification information (entity identification text and/or entity identification image) of multiple entities can be composed of multi-modal information.
- multi-modal search can be performed from the preset multi-modal information. Find location search results matching the location search information from preset multi-modal information.
- multi-modal search may include same-modal search and cross-modal search.
- the same-modal search process can be as follows: when the location search information is text, you can search from the entity identification text contained in the preset multi-modal information to find the entity identification that matches the location search information. Text, as a location search result.
- the location search information is voice, you can first perform voice recognition to obtain the recognition text; then search from the entity identification text contained in the preset multi-modal information to find the entity identification text that matches the recognition text as the location search results.
- the location search information is an image
- the cross-modal search process can be as follows: when the location search information is text or an image, search from the entity identification text and entity identification image contained in the preset multi-modal information to find the location search information. Matching entity identifier text and/or entity identifier images as location search results.
- the location search information is voice, you can first perform voice recognition to obtain the recognized text; then search from the entity identification text and entity identification images contained in the preset multi-modal information to find the entity identification that matches the location search information. Text and/or entity identification images as location search results.
- the location search results may include one or more.
- the location search results may be displayed in the location search result list 108 of FIG. 1(3).
- S203 Perform AR visual navigation based on the location search results.
- the user can select one location search result as needed; at this time, AR visual navigation can be performed based on the location search results selected by the user.
- AR visual navigation can be directly performed based on the location search result determined in S202.
- Figures 3a and 3b are schematic diagrams of exemplary interfaces.
- the user wants to buy black headphones, he can enter the location search information "black headphones" in the edit box in the navigation search interface 101 in Figure 1(1), as shown in Figure 3a(1).
- a multi-modal search is performed in the preset multi-modal information based on "black headphones", and the determined position search results matching the position search information are 2 headphone images: image 1 and image 2; and in Figure 1 (3) is displayed in the location search result list 108, as shown in Figure 3a(2). If the user selects Image 1, AR visual navigation can be provided for the user to go to a store selling the headphones in Image 1.
- the user's mobile phone when the user's mobile phone stores an image of a certain shoe but does not know the brand of the shoe, he can click the image input option 106 in the modal selection window 104 in Figure 1(2); and then select the image from the displayed Select the image of this shoe in the album, as shown in Figure 3b(1).
- a multi-modal search is performed based on the image of the shoe in the preset multi-modal information.
- the determined location search results that match the location search information are three store images: image 3, image 4 and image 5. , and displayed in the location search result list 108 in Figure 1(3), as shown in Figure 3b(2). If the user selects image 5, AR visual navigation can be provided for the user to go to the store corresponding to image 5.
- this application can support users to input location search information in multiple modes.
- the input methods for navigation search are diverse, which can improve the user's navigation experience.
- the location search information may be entity name text, entity name voice, or entity image.
- entity includes a place and the objects contained in the place.
- the AR map In addition to providing users with AR visual navigation using voice/place images, it can also provide users with AR visual navigation when the user inputs object images or object name text or object name voice. Furthermore, the user can be quickly guided to the location of the entity he desires.
- Figure 4 is a schematic diagram illustrating an exemplary navigation process.
- the cross-modal search process and the AR visual navigation process are specifically described.
- the location search information includes at least one of the following: text, voice, and image.
- S401 may refer to the description of S301 above, which will not be described again here.
- the process of performing a cross-modal search in preset multi-modal information to determine the location search results matching the location search information may refer to S402 to S404 as follows:
- S402 Perform feature extraction on the location search information to obtain the first feature corresponding to the location search information.
- S403 Perform feature extraction on multiple entity identification information to obtain second features respectively corresponding to the multiple entity identification information.
- S404 Determine the first feature distance between the second feature corresponding to the plurality of entity identification information and the first feature.
- S405 From the plurality of entity identification information, select the entity identification information whose corresponding first feature distance is less than the first distance threshold as the location search result.
- a cross-modal search model can be pre-trained, and then the trained cross-modal search model can be used to implement cross-modal search.
- the entity name text and entity images of multiple entities can be collected; and the entity name text and entity images of the same entity can be used as a set of training data, so that multiple sets of training data can be obtained.
- the following takes an example of training a cross-modal search model using a set of training data for illustrative explanation.
- the entity name text and entity image of a set of training data can be input into the cross-modal search model.
- the cross-modal search model performs feature extraction on the entity name text to obtain text features corresponding to the entity name text.
- the cross-modal search model extracts features from the entity image to obtain the image features corresponding to the entity image.
- the cross-modal search model can calculate the distance between the text features corresponding to the entity name text and the image features corresponding to the entity image; then, to minimize the distance between the text features corresponding to the entity name text and the image features corresponding to the entity image distance as the target, perform backpropagation on the cross-modal search model to adjust the model parameters of the cross-modal search model.
- multiple sets of training data can be used to train the cross-modal search model; in this way, the cross-modal search model can learn how to unify the image features and text features corresponding to the same entity into the same feature space. . Therefore, using the trained cross-modal search model, mutual search between images and text can be achieved.
- each time one entity identification information and location search information can be input into the trained cross-modal search model; then, on the one hand, the cross-modal search model performs feature extraction on the location search information to obtain the location search The first feature corresponding to the information; on the other hand, the cross-modal search model performs feature extraction on the entity identification information to obtain the second feature corresponding to the entity identification information. Subsequently, the cross-modal search model may calculate the first feature distance between the first feature and the second feature and output it. In this way, the second characteristics and corresponding characteristics of each entity identification information can be obtained. The first feature distance between the first features.
- the location search information when the location search information is voice, the location search information can be subjected to voice recognition to obtain the recognized text, and then the recognized text can be input into the trained cross-modal search model.
- the entity identification information whose first feature distance between the corresponding second feature and the first feature is less than the first distance threshold can be determined; and then the first feature distance between the corresponding second feature and the first feature is less than the first distance threshold.
- the entity identification information with a distance threshold is determined to be the location search result that matches the location search information.
- part of the entity identification information in the multi-modal information only includes entity identification text
- part of the entity identification information only includes entity identification images
- part of the entity identification information includes both entity identification text and entity identification images.
- the entity identification information including the entity identification text and the entity identification image is called first entity identification information.
- the second feature of the entity identification text in the first entity identification information can be obtained.
- the first feature distance between the second feature and the first feature of the entity identification image in the first entity identification information can be obtained.
- the first feature distance between the second feature of the entity identification image and the first feature in the first entity identification information can be compared with the distance between the second feature and the first feature of the entity identification text in the first entity identification information. Perform weighted calculation on the first feature distance; and then use the weighted calculation result as the first feature distance between the second feature corresponding to the first entity identification information and the first feature.
- the entity identification information whose corresponding first feature distance is less than the first distance threshold can be selected as the location search result from the plurality of entity identification information, so that the entity identification information that is highly similar to the location search information can be used as the location search result.
- the first distance threshold can be set according to requirements, and this application does not limit this. in,
- fuzzy search can be achieved, which can cover more comprehensive entity identification information as much as possible to avoid omissions.
- the process of AR visual navigation based on the location search results can be as follows: S406 to S409:
- S406 Based on the pre-generated 2D visual navigation map, generate a 2D navigation path between the user's current location and the target location corresponding to the location search result.
- the target location corresponding to the location search result can be obtained; and then based on the pre-constructed 2D visual navigation map, a 2D navigation path between the user's current location and the target location corresponding to the location search result is generated.
- the generation process of the 2D visual navigation map will be explained later.
- the location search result when the location search result is place identification information, the location of the place corresponding to the place identification information can be obtained as the target location (for example, if the location search information is "**Supermarket", then the location corresponding to "**Supermarket” can be position, as the target position).
- the location search result is object identification information
- the location of the place to which the object belongs corresponding to the object identification information can be obtained as the target location (for example, if the location search information is "**Potato Chips", then the location selling "**Potato Chips" can be The location corresponding to the supermarket is used as the target location).
- the location of the category corresponding to the object identification information can be obtained as the target location (for example, if the location search information is "**Potato Chips", then the location corresponding to "Snack Food” in the supermarket can be obtained position, as the target position).
- the user's current location can be obtained by the positioning module in the device.
- S407 Perform visual positioning to determine the target pose.
- the target pose refers to the current pose of the device.
- visual positioning can be performed to determine the current posture of the device; where the current posture of the device can be used to characterize The user's current posture and subsequent AR visual navigation can be performed based on the user's current posture and 2D navigation path.
- the camera can be started, the camera can be called to collect images, and then visual positioning can be performed based on the collected images to determine the target pose.
- the image collected by the camera during the visual positioning process can be called the first image.
- the user while activating the camera, the user can also be reminded to point the camera of the device forward, so that the target pose determined by visual positioning based on the first image can be closer to the user's current pose.
- the period for the camera to collect the first image may be preset by the device system (for example, the period is 0.5 ms, that is, the camera collects one frame of image every 0.5 ms), and this application does not limit this.
- visual positioning can be performed according to a preset visual positioning period, where the visual positioning period can be set according to requirements, such as 10s, 15s, etc. This application does not limit this.
- the first image collected at the moment when the camera is closest to the current moment can be obtained, and then visual positioning is performed based on the first image to determine the posture when the device collects the first image ( It is later called the target pose). Among them, the visual positioning process will be explained later.
- S408 Perform AR visual navigation based on the 2D navigation path and target pose.
- the first image collected by the camera can be displayed on the AR visual navigation interface 109 in Figure 1(4), and the navigation instruction information in the navigation prompt information option 110 can be updated according to the 2D navigation path and the target pose.
- the update cycle of the first image in the AR visual navigation interface 109 in Figure 1(4) is the same as the cycle of the camera collecting the first image.
- the navigation instruction information in the navigation prompt information option 110 can be updated; that is, the update period of the navigation instruction information in the navigation prompt information option 110 is the same as the visual positioning period.
- the navigation prompt information in the navigation prompt information option 110 displayed for the first time on the AR visual navigation interface 109 can be generated based on the 2D navigation path generated in S406 and the user's current location obtained by the positioning module. Subsequently, after the first visual positioning is completed, the navigation prompt information in the navigation prompt information option 110 can be updated according to the 2D navigation path and the target pose determined by the first visual positioning.
- This application proposes a visual positioning method that combines multi-modal information for visual positioning to improve the visual positioning success rate in scenes such as illumination changing scenes/seasonal changing scenes/viewing scale changing scenes/repeated texture scenes/weak texture scenes.
- Figure 5 is a schematic diagram illustrating an exemplary visual positioning process. In the embodiment of Figure 5, a visual localization process is described.
- the visual positioning map can be introduced as an example first.
- data collection equipment such as image collection equipment, location collection equipment, etc.
- data collection equipment can be controlled to walk in various places (walking outside the place) to collect data such as images of each place.
- three-dimensional reconstruction is performed based on the collected data to obtain a visual positioning map (the visual positioning map is a 3D map).
- the second image For example, while constructing the visual positioning map, you can also extract 2D feature points in the image collected by the data collection device (hereinafter referred to as the second image), and determine that the 2D feature points correspond to the 3D point cloud in the visual positioning map. (Can include 3D coordinates); and record GPS (Global Positioning System, Global Positioning System) information of the second image. Then based on the second image, the 2D feature points contained in the second image, the descriptors of the 2D feature points (referring to the vectors used to describe the texture color information and other information of the feature points), the 3D point cloud corresponding to the 2D feature points, the third The GPS of the second image and the index table of the relationship between the second image and the GPS information are stored in the visual positioning map database.
- GPS Global Positioning System, Global Positioning System
- the second text in the second image can also be extracted; and the global features of the second image can also be extracted to obtain the second global feature vector of the second image. Then, the second text in the second image and the second global feature vector of the second image can be stored in the visual positioning map database.
- map the 3D point cloud in the visual positioning map database to a 2D plane to obtain a 2D visual navigation map.
- the visual positioning process can be described with reference to the following description of steps S501 to S504.
- the camera can be started and the camera can be called to collect the first image.
- S502 Extract the first text in the first image and extract the first global feature vector of the first image.
- the second image that matches the first image can be retrieved from the visual positioning map database through image retrieval (wherein, the second image that matches the first image can refer to a shooting angle similar to that of the first image and Shooting a second image with a similar distance); and then determining the target pose when the device collects the first image based on the first image and the second image obtained through image retrieval.
- image retrieval can be performed based on the information with high-order semantics and global features in the first image and the second image.
- the first image retrieval can be performed on multiple second images based on the information with high-order semantics. Filter; and then perform second filtering on multiple second images based on global features. In this way, the accuracy and retrieval rate of image retrieval can be improved.
- information with high-level semantics such as parking space number, store door number text
- global feature extraction can be performed on the first image to obtain the first global feature vector of the first image.
- a trained feature extraction network can be used to extract the first global feature vector of the first image.
- each network layer in some network layers of the feature extraction network includes a weight corresponding to each area in the image; the target area corresponding to the first text in the first image can be determined; then, add the trained feature extraction network The weight corresponding to the target area in the network layer (such as the last network layer). Then, the first image is input into the feature extraction network, the feature extraction network extracts the global features of the first image, and outputs the first global feature vector of the first image.
- S503 Perform image retrieval based on the first text, the first global feature vector and the preset second text in the plurality of second images and the second global feature vectors of the plurality of second images to retrieve the data from the plurality of second images. Select the first image The accompanying third image and multiple second images were collected during the process of constructing the visual positioning map.
- the process can be performed on multiple second images based on the first text and the second text. Filtering for the first time; and then filtering multiple second images for the second time based on the second global feature vector and the first global feature vector to filter out second images that match the first image.
- the second image obtained through image retrieval that matches the first image is called a third image.
- multiple fourth images containing the first text may be selected from the multiple second images based on the second texts in the multiple second images.
- the first text in the first image is the parking space number "0372”
- the second image whose second text includes "0372" can be found as the fourth image.
- the second feature distance between the second global feature vector and the first global feature vector of each fourth image may be calculated respectively.
- the second feature distance may be Euclidean distance. In this way, the second feature distance between the second global feature vector and the first global feature vector of each fourth image can be obtained.
- the second image i.e. the third image
- the third image may be one or multiple images, and this application does not limit this.
- the second distance threshold can be set according to requirements, and this application does not limit this.
- the visual positioning map data does not include the second text of the second image and the second global feature vector of the second image
- the second text in each second image can be extracted, and the second text of the second image can be extracted.
- the second global feature vector is
- S504 Determine the target pose based on the first image and the third image.
- the target pose refers to the pose of the device when the first image is collected.
- feature matching and pose estimation can be performed based on the first image and the third image to determine the pose of the device when the first image is collected.
- the feature matching process can be as follows:
- Extract 2D feature points in the first image For example, the feature points of the first image can be detected and the 2D feature points in the first image can be extracted.
- a vector that is, a descriptor, used to describe the texture color information and other information of the 2D feature point can be generated.
- the descriptors of the same feature point on different images are close to each other, while the descriptors of different feature points are far apart.
- Match feature points between the first image and the third image For example, the descriptor distance between the descriptor of the 2D feature point in the first image and the descriptor of the 2D feature point in the third image can be calculated; then, the descriptor distance between the descriptor of the 2D feature point in the third image and the first image can be calculated.
- the 2D feature points whose descriptor distance is smaller than the third distance threshold are used as candidate feature points.
- Interior point screening geometric constraints can be applied to the candidate feature points to filter candidate feature points that match incorrectly and obtain the target feature points.
- the process of pose estimation is as follows: the 3D point cloud corresponding to the target feature point can be determined; and then the pose is solved according to the position of the target feature point on the third image and the 3D coordinates in the 3D point cloud to obtain target pose.
- this application uses multiple modalities of information such as images and text with high-level semantics in images to perform visual positioning, which can effectively improve the success rate of visual positioning in these scenarios.
- the visual positioning method described in the embodiment of Figure 5 can be applied to the embodiment of Figure 4; in this way, by improving the lighting change scene/seasonal change scene/perspective scale change scene/repeated texture scene/weak texture scene, etc.
- the success rate of visual positioning in the scene can improve the accuracy of AR visual navigation and improve the user navigation experience.
- S409 in Figure 4 is executed in the SLAM (Simultaneous Localization and Mapping) coordinate system (can be executed by the SLAM system), and the target pose is calculated in the visual positioning coordinate system ; and then the target pose can be converted from the visual positioning coordinate system to the SLAM coordinate system, so that the subsequent SLAM system can perform AR visual navigation based on the first image, the 2D navigation path and the target pose.
- SLAM Simultaneous Localization and Mapping
- visual positioning method in Figure 5 can also be applied to other scenarios, such as holographic information display scenarios, photo taking scenarios, intelligent explanation scenarios, intelligent and intelligent IoT (Internet of Things, Internet of Things) scenarios, and AR game interaction scenarios. Etc., this application does not limit this.
- multiple frames of first images can be collected, and then multiple frames can be combined for visual positioning to further improve the success rate of visual positioning.
- Figure 6 is a schematic diagram illustrating an exemplary visual positioning process.
- the device when the user rotates the device, the device can collect multiple frames of the first image; and then perform visual positioning based on the rotation of the multiple frames of the first image.
- K1 first images are collected.
- K1 is an integer greater than 1.
- S602 Extract the first text in the K1 first images and extract the first global feature vector of the K1 first images.
- the first text of each first image among the K1 first images can be extracted, and the first global feature vector of each first image can be extracted.
- the vector performs image retrieval to select M third images that respectively match the K1 first images from a plurality of second images.
- a third image matching each first image may be selected from a plurality of second images.
- the number of third images matching each first image is M (M is a positive integer).
- the number of third images matching each first image may be the same or different. This application applies This is not a limitation.
- N is a positive integer.
- each of the K1 first images can be determined as a target image in turn; then N candidate bits corresponding to the target image can be determined based on the target image and the M third images that match the target image. posture.
- M third images matching the target image can be first subjected to co-view clustering to obtain N groups of third images; and then according to the method of S504, based on the target image and a group of third images, determine A candidate pose corresponding to the target image. In this way, N candidate poses corresponding to the target image can be determined.
- common-view clustering may refer to clustering with the goal of ensuring that any two sets of third images contain some common 3D point clouds and the number of 3D point clouds is greater than or equal to 1.
- the poses are clustered to obtain N candidate poses corresponding to the target image.
- the single frame confidence levels corresponding to the N candidate poses may be determined.
- the N candidate poses corresponding to each of the K first images and the single frame confidence levels corresponding to the N candidate poses can be determined.
- S605 Determine the target pose based on the N candidate poses corresponding to the K first images and the single frame confidence levels corresponding to the N candidate poses.
- N candidate poses corresponding to K first images can be traversed, and one candidate pose is selected from the N candidate poses corresponding to any two first images to form a pose combination, so as to Get multiple pose combinations.
- the j1th candidate pose can be selected from the N candidate poses corresponding to the i1th first image
- the j2th candidate pose can be selected from the N candidate poses corresponding to the i2th first image; Then the j1th candidate pose and the j2th candidate pose are used to form a pose combination.
- i1 and i2 are positive integers between 1 and K, i1 is not equal to i2; j1 and j2 are positive integers between 1 and M, and j1 is not equal to j2.
- multiple pose combinations can be obtained.
- the relative pose between two candidate poses included in the first pose combination can be calculated, That is, the relative pose predicted between the two images.
- the SLAM system can output the relative pose between any two of the K first images, which is the true relative pose between the two images, that is, the SLAM pose. Then, the SLAM pose between the two first images corresponding to the first pose combination can be obtained; then, the relative pose between the two candidate poses in the first pose combination can be calculated, and the relative pose between the first pose combination and The pose error of the SLAM pose between the corresponding two first images is the error between the predicted relative pose and the true relative pose.
- the preset error can be set according to requirements, and this application does not limit this.
- the joint confidence weight of the candidate pose combination can be determined first, and then the joint confidence corresponding to the candidate pose combination can be determined based on the single frame confidence and joint confidence weight of the two candidate poses included in the candidate pose combination. Spend. For example, the single-frame confidence of the two candidate poses included in the candidate pose combination is multiplied by the joint confidence weight to obtain the joint confidence corresponding to the candidate pose combination.
- the joint confidence weight can be determined based on the single-frame confidence of two candidate poses included in the candidate pose combination and the normal distribution function. For example, the relative pose between the single-frame confidences of two candidate poses included in the candidate pose combination can be calculated, and the relative pose is used as the independent variable of the normal distribution function to obtain the corresponding probability distribution, that is, the joint confidence degree weight.
- the mean and variance of the normal distribution function can be set according to requirements, for example, the mean is 0 and the variance is 1. This application does not limit this.
- the large-angle visual positioning of multiple frames can be combined, thereby improving the success rate of visual positioning.
- it can improve the visual positioning success rate in scenes such as lighting changing scenes, seasonal changing scenes, viewing angle scale changing scenes, repeated texture scenes, and weak texture scenes.
- the N corresponding to the K first images can be The pose with the highest confidence in a single frame among the candidate poses is determined as the target pose.
- Figure 7 is a schematic diagram illustrating an exemplary visual positioning process.
- the first image used for this visual positioning and the first image used for the last visual positioning are implemented.
- the visual positioning process does not require user cooperation, which can not only improve the success rate of visual positioning, but also improve the user experience and have higher practicability.
- S702 Extract the first text in the first image and extract the first global feature vector of the first image.
- S703 Perform image retrieval based on the first text, the first global feature vector and the preset second text in the plurality of second images and the second global feature vectors of the plurality of second images to retrieve the data from the plurality of second images.
- a third image matching the first image is selected, and a plurality of second images are collected in the process of constructing the visual positioning map.
- S704 Determine the R first candidate poses with the highest confidence in the single frame corresponding to the first image used for this visual positioning, where R is a positive integer.
- S704 may refer to the method of S604 to determine multiple candidate poses corresponding to the first image used for this visual positioning; and then select the single frame with the highest confidence among the multiple candidate poses corresponding to the first image. R first candidate poses.
- S705 to S707 can be executed to determine the target pose; on the other hand, You can save this R The first candidate pose is used for subsequent visual positioning.
- S705 Add SLAM poses to the R second candidate poses with the highest single-frame confidence corresponding to the first image used in the last visual positioning, respectively, to obtain R third candidate poses.
- the R second candidate poses with the highest single-frame confidence corresponding to the first image used in the last visual positioning were stored; therefore, these R second candidate poses can be obtained.
- S706 Determine the probabilities corresponding to the R third candidate poses and the probabilities corresponding to the R first candidate poses.
- the three second candidate poses be a1, a2, and a3 respectively
- the single-frame confidence levels of the three second candidate poses are: P a1 , P a1 , and P a3 respectively.
- the three first candidate poses are b1, b2 and b3 respectively
- the single-frame confidence levels of the three first candidate poses are: P b1 , P b2 , P b3 respectively.
- the success probability of multi-frame positioning is P ms
- the failure probability of multi-frame positioning is P mf
- the success probability of single-frame positioning is P ss
- the failure probability of single-frame positioning is P sf .
- the first candidate pose and the second candidate pose are The relative pose, the probability of being the same as the SLAM pose is P ⁇ rt .
- the probability of the third candidate pose c1 is: P a1 *P sf *P ms
- the probability of the third candidate pose c2 is: P a2 *P sf *P ms
- the probability of the third candidate pose c3 is :P a3 *P sf *P ms .
- the probability of the first candidate pose b1 is: P b1 *P a *P ss *P ⁇ rt *P ms or P b1 *P ss *P mf .
- the probability of the third candidate pose b2 is: P b2 *P a * P ss *P ⁇ rt *P ms or P b2 *P ss *P mf
- the probability of the third candidate pose b3 is: P b3 *P a *P ss *P ⁇ rt *P ms or P b3 *Pss*P mf .
- S707 Use the first candidate pose or the third candidate pose with the highest probability as the target pose corresponding to this visual positioning.
- R first candidate poses and R third candidate poses For example, from R first candidate poses and R third candidate poses, select the candidate pose with the highest probability (which may be the first candidate pose or the third candidate pose) as the current candidate pose.
- the target pose corresponding to the secondary visual positioning.
- the error of single-frame visual positioning is reduced and the success rate of visual positioning is improved.
- it can improve the success rate of visual positioning in scenes such as lighting changing scenes/seasonal changing scenes/perspective scale changing scenes/repeated texture scenes/weak texture scenes.
- This application also provides a navigation map construction method, which registers the location of the place into the visual positioning map and the 2D visual navigation map generated in the embodiment of Figure 5; in this way, when the user uses the AR map to navigate, the AR map can provide AR Visual navigation means executing S203 in the above-mentioned embodiment of FIG. 2, or S406-S409 in the above-mentioned embodiment of FIG. 4.
- Figure 8 is a schematic diagram illustrating an exemplary navigation map construction process.
- the place identification information includes place identification text and/or place identification graphics.
- the location identification information of the location can be collected.
- the location identification information may be location identification text or location identification graphics.
- the place identification text can be the name of the supermarket
- the place identification graphic can be the supermarket brand name. mark.
- the place logo text can be the name of the zoo
- the place logo graphic can be the zoo trademark
- S802 Perform multi-modal retrieval among a plurality of preset first images to determine a second image containing place identification information.
- the plurality of first images are collected in the process of constructing a visual positioning map.
- a visual positioning map database is generated; wherein the visual positioning map database may include images collected by the data collection device (hereinafter referred to as the first image), 2D feature points contained in the first image , 3D point cloud corresponding to the 2D feature point (which may include 3D coordinates), and data such as the identification text of the first image.
- the visual positioning map database may include images collected by the data collection device (hereinafter referred to as the first image), 2D feature points contained in the first image , 3D point cloud corresponding to the 2D feature point (which may include 3D coordinates), and data such as the identification text of the first image.
- the place identification information is place identification text (such as place name text)
- the first image containing the place name text can be selected as the second image according to the identification text of the first image in the visual positioning map database.
- the place identification information is a place identification graphic (such as a place trademark graphic)
- the graphics contained in multiple first images in the visual positioning map database can be extracted respectively; and then based on the graphics contained in the first image, the graphics can be extracted from the multiple first images.
- the first image including the location identification graphic is selected from the first images as the second image.
- a first image containing the name/trademark of the supermarket can be selected from multiple first images as the second image.
- a first image containing the name/trademark of the zoo can be selected from multiple first images as the second image.
- a first image containing the name/trademark of the botanical garden may be selected from a plurality of first images as the second image.
- the second image corresponding to the place may include one or more images.
- S803 Determine the 3D coordinates of the place in the visual positioning map based on the place identification information in the second image.
- the 2D feature points contained in the visual positioning map database are determined; and then from the visual positioning map database, the 3D feature points corresponding to the place identification information in the second image are determined.
- Point cloud (including 3D coordinates).
- the 3D coordinates in the 3D point cloud corresponding to the place identification information in the second image can then be determined as the 3D coordinates of the place in the visual positioning map.
- 3D point clouds corresponding to the place identification information in the multiple second images can be obtained; the place identification information in each second image can be corresponding to the 3D point cloud.
- the average value of the 3D coordinates is used as the 3D coordinates of the location in the visual positioning map.
- S804 Map the 3D coordinates of the place in the visual positioning map to the 2D visual navigation map to obtain the 2D coordinates of the place in the 2D visual navigation map.
- the 3D coordinates of the place in the visual positioning map can be directly mapped to the 2D plane, that is, 3D to 2D conversion can be performed, and then the 2D coordinates of the place in the 2D visual navigation map can be obtained.
- the place mark information in each second image can be mapped to the 3D coordinates in the 3D point cloud to the 2D plane, and multiple sets of 2D coordinates can be obtained. Then, the average value of multiple sets of 2D coordinates can be obtained to obtain the 2D coordinates of the location in the 2D visual navigation map.
- a navigation map within the place can be constructed to subsequently guide users to quickly reach the locations of various objects in the larger place.
- Figure 9 is a schematic diagram illustrating an exemplary navigation map construction process. In the embodiment of Figure 9, a process of constructing a visual positioning map within a venue is shown.
- S901 Collect map reconstruction data in the place, and perform three-dimensional reconstruction based on the map reconstruction data to update the visual positioning map.
- the map reconstruction data includes a third image in the place, and the place contains multiple categories of objects.
- the method of constructing the visual positioning map within the place is similar to the construction of the visual positioning map described in the embodiment of FIG. 5 above. The difference is that in constructing the visual positioning map within the place, one walks in the place to Collect images and other data within the site; then perform three-dimensional reconstruction based on the collected data to obtain a visual positioning map within the site.
- the visual positioning map in the place can be added to the visual positioning map in the embodiment of FIG. 5 to update the visual positioning map in the embodiment of FIG. 5 to obtain an updated visual positioning map.
- the visual positioning map in the place can be regarded as a sub-map of the visual positioning map in the embodiment of FIG. 5 mentioned above.
- the user can visually locate at the entrance of the venue and then start the SLAM system in the AR map.
- the user can hold a mobile phone or wear a device such as AR glasses (the camera is on) to traverse the category identification information of all categories in the place (which can include category identification text and/or category identification information).
- logo graphics as the target, walking and collecting in the place to collect the third image in the place.
- OCR can be performed on the third image to identify the category identification text in the third image.
- the category identification text recognized from the third image may be such as "toiletries,” “snacks,” “fresh fruits and vegetables,” and so on.
- the category identification text recognized from the third image may be such as "Seal House”, “Penguin House”, “Giraffe Feeding Area”, etc.
- 2D feature points corresponding to the category identification text in the third image can be extracted.
- S903 Determine the 3D point cloud of the 2D feature points corresponding to the category identification text in the SLAM coordinate system.
- the SLAM system can perform calculations to determine the 2D feature points corresponding to the category identification text and the 3D point cloud in the SLAM coordinate system.
- S904 Map the 3D point cloud in the SLAM coordinate system to a 3D point cloud in the visual positioning map coordinate system corresponding to the updated visual positioning map.
- the SLAM coordinate system and the visual positioning map coordinate system are different coordinate systems.
- the 3D point cloud in the SLAM coordinate system can be mapped to the updated
- the visual positioning map corresponds to the 3D point cloud in the visual positioning coordinate system.
- the following describes the process of constructing a 2D visual navigation map corresponding to each category of objects in the site.
- Figure 10 is a schematic diagram illustrating an exemplary navigation map construction process.
- the category identification information includes category identification text and/or category identification graphics.
- the 3D point cloud contained in the visual positioning map within the venue in the updated visual positioning map can be mapped to a 2D plane, and a 2D visual navigation map within the venue can be obtained; and then the 2D visual navigation map within the venue, Add to the 2D visual navigation map in the embodiment of Figure 5 to update the 2D visual navigation map containing multiple locations to obtain an updated 2D visual navigation map.
- the 2D visual navigation map in the place can be regarded as a sub-map of the 2D visual navigation map in the embodiment of FIG. 5 mentioned above.
- S1005 Map the 3D coordinates of the object of the category in the updated visual positioning map to the 2D visual navigation map to obtain the 2D coordinates of the object of the category in the 2D visual navigation map.
- FIG. 11 shows a schematic block diagram of a device 1100 according to an embodiment of the present application.
- the device 1100 may include: a processor 1101 and a transceiver/transceiver pin 1102, and optionally, a memory 1103.
- bus 1104 includes, in addition to a data bus, a power bus, a control bus, and a status signal bus.
- bus 1104 various buses are referred to as bus 1104 in the figure.
- the memory 1103 may be used to store instructions in the foregoing method embodiments.
- the processor 1101 can be used to execute instructions in the memory 1103, and control the receiving pin to receive signals, and control the transmitting pin to send signals.
- the device 1100 may be the electronic device or a chip of the electronic device in the above method embodiment.
- This embodiment also provides a computer-readable storage medium.
- Computer instructions are stored in the computer-readable storage medium.
- the electronic device causes the electronic device to execute the above-mentioned related method steps to implement the above-mentioned embodiments. Navigation method and/or visual positioning method and/or navigation map construction method.
- This embodiment also provides a computer program product.
- the computer program product When the computer program product is run on a computer, it causes the computer to perform the above related steps to implement the navigation method and/or visual positioning method and/or navigation map in the above embodiment. Build method.
- inventions of the present application also provide a device.
- This device may be a chip, a component or a module.
- the device may include a connected processor and a memory.
- the memory is used to store computer execution instructions.
- the processor can execute computer execution instructions stored in the memory, so that the chip executes each of the above method embodiments.
- the navigation method and/or the visual positioning method and/or the navigation map construction method are used to store computer execution instructions.
- the electronic devices, computer-readable storage media, computer program products or chips provided in this embodiment are all used to execute the corresponding methods provided above. Therefore, the beneficial effects they can achieve can be referred to the above provided The beneficial effects of the corresponding methods will not be described again here.
- the disclosed devices and methods can be implemented in other ways.
- the device embodiments described above are only illustrative.
- the division of modules or units is only a logical function division.
- there may be other division methods for example, multiple units or components may be combined or can be integrated into another device, or some features can be ignored, or not implemented.
- the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
- a unit described as a separate component may or may not be physically separate.
- a component shown as a unit may be one physical unit or multiple physical units, that is, it may be located in one place, or it may be distributed to multiple different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
- the above integrated units can be implemented in the form of hardware or software functional units.
- Integrated units may be stored in a readable storage medium if they are implemented in the form of software functional units and sold or used as independent products.
- the technical solutions of the embodiments of the present application are essentially or contribute to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the software product is stored in a storage medium , including several instructions to cause a device (which can be a microcontroller, a chip, etc.) or a processor to execute all or part of the steps of the methods of various embodiments of the present application.
- the aforementioned storage media include: U disk, mobile hard disk, read only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program code.
- the steps of the methods or algorithms described in connection with the disclosure of the embodiments of this application can be implemented in hardware or by a processor executing software instructions.
- Software instructions can be composed of corresponding software modules, and software modules can be stored in random access memory (Random Access Memory, RAM), flash memory, read-only Memory (Read Only Memory, ROM), Erasable Programmable ROM (EPROM), Electrically Erasable Programmable Read Only Memory (Electrically EPROM, EEPROM), register, hard disk, mobile hard disk, read-only disc (CD-ROM) or any other form of storage media well known in the art.
- An exemplary storage medium is coupled to the processor such that the processor can read information from the storage medium and write information to the storage medium.
- the storage medium can also be an integral part of the processor.
- the processor and storage media may be located in an ASIC.
- Computer-readable media includes computer-readable storage media and communication media including any medium that facilitates transfer of a computer program from one place to another.
- Storage media can be any available media that can be accessed by a general purpose or special purpose computer.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Remote Sensing (AREA)
- General Engineering & Computer Science (AREA)
- Radar, Positioning & Navigation (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computer Graphics (AREA)
- Computer Hardware Design (AREA)
- Automation & Control Theory (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Multimedia (AREA)
- Processing Or Creating Images (AREA)
Abstract
本申请实施例提供了一种导航、视觉定位及导航地图构建方法和电子设备,应用于数据处理领域。该导航方法包括:首先,获取用户在增强现实AR地图中输入的位置搜索信息,位置搜索信息包括以下至少一种:文本、语音或图像;接着,基于位置搜索信息,在预设的多模态信息中进行多模态搜索,以确定位置搜索信息匹配的位置搜索结果;随后,根据位置搜索结果进行AR视觉导航。这样,相对于现有技术仅能输入文本进行导航而言,本申请可以支持用户输入多种模态的位置搜索信息,导航搜索的输入方式多样,能够提高用户导航体验。
Description
本申请要求于2022年06月22日提交中国国家知识产权局、申请号为202210709970.3、申请名称为“导航、视觉定位以及导航地图构建方法和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请实施例涉及数据处理领域,尤其涉及一种导航、视觉定位以及导航地图构建方法和电子设备。
AR(Augmented Reality,增强现实)地图是基于空间计算技术的商业应用之一,可以提供全息信息展示(可以让用户看到各类虚实融合的信息标牌及其详细介绍),智能搜索(可以让用户能轻松找到传说中的网红打卡点、最近的洗手间等),AR视觉导航(可以让用户实时直观地看到AR效果的导航)以及AR交互(可以让用户与虚拟角色合影、在AR世界中参与丰富多彩的虚拟活动等)等等功能。
目前,在使用AR地图进行智能搜索的过程中,只能输入文本进行搜索,如输入景点名称(如**公园),商店名称(如**黄金),建筑名称(如**大厦)等等;搜索的输入方式单一,用户体验不佳。
发明内容
为了解决上述技术问题,本申请提供一种导航、视觉定位以及导航地图构建方法和电子设备。在该导航方法中,可以支持用户输入多模态位置搜索信息,然后根据用户在AR地图中输入的多模态的位置搜索信息,为用户提供AR视觉导航;这样,能够增加导航搜索输入方式多样,提高用户导航体验。
第一方面,本申请实施例提供一种导航方法,该方法包括:首先,获取用户在增强现实AR地图中输入的位置搜索信息,位置搜索信息包括以下至少一种:文本、语音或图像;接着,基于位置搜索信息,在预设的多模态信息中进行多模态搜索,以确定位置搜索信息匹配的位置搜索结果;随后,根据位置搜索结果进行AR视觉导航。这样,相对于现有技术仅能输入文本进行导航而言,本申请可以支持用户输入多种模态的位置搜索信息,导航搜索的输入方式多样,能够提高用户导航体验。
示例性的,位置搜索信息也可以是文本和语音,或者文本和图像,或者语音和图像,或者文本、语音和图像,本申请对此不作限制。
示例性的,AR地图可以是设备中的应用程序或小程序,本申请对此不作限制。
示例性的,位置搜索结果可以是一个,也可以是多个。当位置搜索结果为多个时,可以根据用户选择的位置搜索结果进行AR视觉导航;也可以按照预设规则(如与用户
距离最近,评价最好等)选取一个位置搜索结果进行AR视觉导航;本申请对此不作限制。
根据第一方面,多模态信息包括多个实体的实体标识信息,实体标识信息包括实体标识文本和/或实体标识图像;基于位置搜索信息,在预设的多模态信息中进行多模态搜索,以确定位置搜索信息匹配的位置搜索结果,包括:对位置搜索信息进行特征提取,以得到位置搜索信息对应的第一特征;对多个实体标识信息进行特征提取,以得到多个实体标识信息分别对应的第二特征;确定多个实体标识信息分别对应的第二特征,与第一特征之间的第一特征距离;从多个实体标识信息中,选取对应第一特征距离小于第一距离阈值的实体标识信息作为位置搜索结果。
这样,可以实现跨模态搜索,即当位置搜索信息为文本或图像时,从预设的多模态信息包含的实体标识文本和实体标识图像中进行搜索,以查找与位置搜索信息匹配的实体标识文本和/或实体标识图像,作为位置搜索结果。当位置搜索信息为语音时,可以先进行语音识别,得到识别文本;再从预设的多模态信息包含的实体标识文本和实体标识图像中进行搜索,以查找与位置搜索信息匹配的实体标识文本和/或实体标识图像作为位置搜索结果;进而实现图像和文本之间相互搜索。这样,可以实现“模糊”搜索,进而能够尽可能覆盖比较全面的实体标识信息,避免出现遗漏。
示例性的,多模态搜索还可以包括同模态搜索。示例性的,同模态搜索的过程可以如下:当位置搜索信息为文本时,可以从预设的多模态信息所包含的实体标识文本中进行搜索,以查找与位置搜索信息匹配的实体标识文本,作为位置搜索结果。当位置搜索信息为语音时,可以先进行语音识别,得到识别文本;再从预设的多模态信息所包含的实体标识文本中进行搜索,以查找与识别文本匹配的实体标识文本,作为位置搜索结果。当位置搜索信息为图像时,可以从预设的多模态信息所包含的实体标识图像中进行搜索,查找与位置搜索信息匹配的实体标识图像,作为位置搜索结果。
根据第一方面,或者以上第一方面的任意一种实现方式,位置搜索信息包括以下至少一种:实体名称文本、实体名称语音或实体图像;实体包括:场所和/或场所包含的对象。这样,AR地图除了可以在用户输入场所名称文本/场所名称语音/场所图像时为用户提供AR视觉导航之外,还可以在用户输入对象图像或对象名称文本或对象名称语音时为用户提供AR视觉导航。进而,能够快速引导用户至自身所期望的实体的位置。
例如,场所为超市,场所包含的对象为超市售卖的商品;这样,无论用户输入的是超市的名称/图像,还是超市内售卖的商品的名称/图像,AR地图都可以快速引导用户至自身所期望的超市的位置。
例如,场所为动物园,场所包含的对象为动物园内的动物;这样,无论用户输入的是动物园的名称/图像,还是动物园内动物的名称/图像,AR地图都可以快速引导用户至自身所期望的动物园的位置。
根据第一方面,或者以上第一方面的任意一种实现方式,确定多个实体标识信息分
别对应的第二特征,与第一特征之间的第一特征距离,包括:针对多个实体标识信息中第一实体标识信息,第一实体标识信息包括实体标识文本和实体标识图像:将第一实体标识信息包含的实体标识文本所对应的第二特征与第一特征之间的第一特征距离,与第一实体标识信息包含的实体标识图像所对应的第二特征与第一特征之间的第一特征距离进行加权计算;将加权计算的结果,作为第一实体标识信息对应的第二特征与第一特征之间的第一特征距离。
根据第一方面,或者以上第一方面的任意一种实现方式,根据位置搜索结果进行AR视觉导航,包括:基于预先生成的2D视觉导航地图,生成用户的当前位置和位置搜索结果对应目标位置之间的2D导航路径;进行视觉定位,以确定目标位姿,目标位姿是指设备的当前位姿;基于2D导航路径和目标位姿,进行AR视觉导航。
示例性的,采集第一图像时设备可以是手机,也可以是AR设备等可以采集图像的设备,本申请对此不作限制。
根据第一方面,或者以上第一方面的任意一种实现方式,进行视觉定位,以确定目标位姿,包括:采集第一图像;提取第一图像中的第一文本和提取第一图像的第一全局特征向量;基于第一文本、第一全局特征向量以及预设的多张第二图像中的第二文本和多张第二图像的第二全局特征向量进行图像检索,以从多张第二图像中选取第一图像匹配的第三图像,多张第二图像是在构建视觉定位地图过程中采集的,2D视觉导航地图是基于视觉定位地图生成的;基于第一图像和第三图像,确定目标位姿,目标位姿是指采集第一图像时设备的位姿。
由于具有高阶语义的文本能够很好的区别光照变化场景/季节变化场景/视角尺度变化场景/重复纹理场景/弱纹理场景等场景中的不同图像,进而相对于现有技术仅根据图像这一种模态的信息进行视觉定位而言,本申请通过图像和图像中具有高阶语义的文本等多种模态的信息进行视觉定位,能够有效提升这些场景下的视觉定位成功率。
根据第一方面,或者以上第一方面的任意一种实现方式,根据第一文本、第一全局特征向量以及预设的多张第二图像中的第二文本和多张第二图像的第二全局特征向量进行图像检索,以从多张第二图像中选取第一图像匹配的第三图像,包括:根据多张第二图像中的第二文本,从多张第二图像中选取包含第一文本的多张第四图像;分别确定多张第四图像的第二全局特征向量,与第一全局特征向量之间的第二特征距离;从多张第四图像中,选取对应第二特征距离小于第二距离阈值的第三图像。这样,可以实现图像检索,从多张第二图像中检索出与第一图像匹配的第三图像(其中,与第一图像匹配的第二图像可以是指,与第一图像拍摄角度相似且拍摄距离相近的第二图像)。通过两次过滤,可以提高图像检索的准确性和检索速率。
根据第一方面,或者以上第一方面的任意一种实现方式,提取第一图像的第一全局特征向量,包括:确定第一文本在第一图像中对应的目标区域;增加已训练的特征提取
网络的网络层中目标区域对应的权重;将第一图像输入至特征提取网络,以得到特征提取网络输出的第一全局特征向量。这样,能够增加第一全局特征向量中第一文本对应的特征的准确性,进而能够提高第二次过滤所选取出的第三图像的准确性。
根据第一方面,或者以上第一方面的任意一种实现方式,采集第一图像,包括:在设备旋转过程中,采集K张第一图像,每张第一图像匹配的第三图像为M张,K为大于1的整数,M为正整数;基于第一图像和第三图像,确定目标位姿,包括:基于K张第一图像分别匹配的M张第三图像,确定K张第一图像分别对应的N个候选位姿和N个候选位姿分别对应的单帧置信度,N为正整数;根据K张第一图像分别对应的N个候选位姿和N个候选位姿分别对应的单帧置信度,确定目标位姿。这样,采用旋转多帧的方式进行视觉定位,可以实现大视角视觉定位,进而提高视觉定位的成功率。尤其是能够提高光照变化场景/季节变化场景/视角尺度变化场景/重复纹理场景/弱纹理场景等场景中的视觉定位成功率。
根据第一方面,或者以上第一方面的任意一种实现方式,根据K张第一图像分别对应的N个候选位姿和N个候选位姿分别对应的单帧置信度,确定目标位姿,包括:遍历K张第一图像分别对应的N个候选位姿,从任意两张第一图像分别对应的N个候选位姿中分别选取一个候选位姿组成一个位姿组合,以得到多个位姿组合;针对一个第一位姿组合,确定第一位姿组合对应两张第一图像之间的SLAM(Simultaneous Localization and Mapping,同时定位与建图)位姿,以及第一位姿组合中两个候选位姿之间的相对位姿;确定SLAM位姿与相对位姿之间的位姿误差;若存在位姿误差小于预设误差的候选位姿组合,则根据候选位姿组合对应的单帧置信度,确定候选位姿组合对应的联合置信度;将联合置信度最高的候选位姿组合,确定为目标位姿。
根据第一方面,或者以上第一方面的任意一种实现方式,若不存在位姿误差小于预设误差的候选位姿组合,则将K张第一图像分别对应的N个候选位姿中单帧置信度最高的候选位姿,确定为目标位姿。
根据第一方面,或者以上第一方面的任意一种实现方式,基于第一图像和第三图像,确定目标位姿,包括:确定本次视觉定位采用的第一图像对应的单帧置信度最高的R个第一候选位姿,R为正整数;将上一次视觉定位采用的第一图像对应的单帧置信度最高的R个第二候选位姿,分别增加SLAM位姿,得到R个第三候选位姿;确定R个第三候选位姿分别对应的概率,以及R个第一候选位姿分别对应的概率;将概率最高的第一候选位姿或第三候选位姿,确定为目标位姿。这样,联合上一次视觉定位确定的多个候选位姿,进行本次视觉定位,来降低单帧视觉定位的误差,提高视觉定位的成功率。尤其可以提高光照变化场景/季节变化场景/视角尺度变化场景/重复纹理场景/弱纹理场景等场景中的视觉定位的成功率。
根据第一方面,或者以上第一方面的任意一种实现方式,该方法还包括:采集场所的场所标识信息,场所标识信息包括场所标识文本和/或场所标识图形;在预设的多张第二图像中进行多模态搜索,以确定包含场所标识信息的第五图像,多张第二图像是在构建视觉定位地图过程中采集的;根据第五图像中的场所标识信息,确定场所在视觉定位地图中的3D(3-dimension,三维)坐标;将场所在视觉定位地图中的3D坐标,映射至2D(2-dimension,二维)视觉导航地图中,以得到场所在2D视觉导航地图中的2D坐标。这样,可以将场所的位置注册至视觉定位图像和2D视觉导航地图,进而在用户使用AR地图导航时,能够提供AR视觉导航。
根据第一方面,或者以上第一方面的任意一种实现方式,该方法还包括:采集场所内的地图重建数据,根据地图重建数据进行三维重建,以更新视觉定位地图,地图重建数据包括场所内的第六图像,场所包含多个类别的对象;提取第六图像中类别标识文本对应的2D特征点,以及确定类别标识文本对应的2D特征点在SLAM坐标系中的3D点云;将SLAM坐标系中的3D点云,映射为更新后的视觉定位地图对应视觉定位地图坐标系中的3D点云。这样,可以构建场所内的视觉定位地图,以便于后续指引用户快速达到较大场所内的各类对象所在位置。
示例性的,针对较大的场所(如大型超市、大型动物园等等),可以构建场所内的视觉定位地图;对于小的场所可以无需构建场所内的视觉定位地图。
根据第一方面,或者以上第一方面的任意一种实现方式,该方法还包括:采集类别的类别标识信息,类别标识信息包括类别标识文本和/或类别标识图形;在第六图像中进行多模态搜索,以确定包含类别标识信息的第七图像;根据第七图像中的类别标识信息,确定类别的对象在更新后的视觉定位地图中的3D坐标;依据更新后的视觉定位地图,更新2D视觉导航地图;将类别包含的对象在视觉定位地图中的3D坐标,映射至更新后的2D视觉导航地图中,以得到类别的对象在更新后的2D视觉导航地图中的2D坐标。这样,可以将场所内各类别的对象的位置注册至视觉定位图像和2D视觉导航地图,进而在用户使用AR地图导航时,能够提供AR视觉导航,以指引用户快速达到较大场所内的各类对象所在位置。
第二方面,本申请实施例提供一种视觉定位方法,该方法包括:首先,采集第一图像;接着,提取第一图像中的第一文本和提取第一图像的第一全局特征向量;随后,基于第一文本、第一全局特征向量以及预设的多张第二图像中的第二文本和多张第二图像的第二全局特征向量进行图像检索,以从多张第二图像中选取第一图像匹配的第三图像,多张第二图像是在构建视觉定位地图过程中采集的;然后,基于第一图像和第三图像,确定目标位姿,目标位姿是指采集第一图像时设备的位姿。
由于具有高阶语义的文本能够很好的区别光照变化场景/季节变化场景/视角尺度变化场景/重复纹理场景/弱纹理场景等场景中的不同图像,进而相对于现有技术仅根据图像这一种模态的信息进行视觉定位而言,本申请通过图像和图像中具有高阶语义的文本等
多种模态的信息进行视觉定位,能够有效提升这些场景下的视觉定位成功率。
根据第二方面,基于第一文本、第一全局特征向量以及预设的多张第二图像中的第二文本和多张第二图像的第二全局特征向量进行图像检索,以从多张第二图像中选取第一图像匹配的第三图像,包括:根据多张第二图像中的第二文本,从多张第二图像中选取包含第一文本的多张第四图像;分别确定多张第四图像的第二全局特征向量,与第一全局特征向量之间的第二特征距离;从多张第四图像中,选取对应第二特征距离小于第二距离阈值的第三图像。这样,可以实现图像检索,从多张第二图像中检索出与第一图像匹配的第三图像(其中,与第一图像匹配的第二图像可以是指,与第一图像拍摄角度相似且拍摄距离相近的第二图像)。通过两次过滤,可以提高图像检索的准确性和检索速率。
根据第二方面,或者以上第二方面的任意一种实现方式,提取第一图像的第一全局特征向量,包括:确定第一文本在第一图像中对应的目标区域;增加已训练的特征提取网络的网络层中目标区域对应的权重;将第一图像输入至特征提取网络,以得到特征提取网络输出的第一全局特征向量。这样,能够增加第一全局特征向量中第一文本对应的特征的准确性,进而能够提高第二次过滤所选取出的第三图像的准确性。
根据第二方面,或者以上第二方面的任意一种实现方式,采集第一图像,包括:在设备旋转过程中,采集K张第一图像,每张第一图像匹配的第三图像为M张,K为大于1的整数,M为正整数;基于第一图像和第三图像,确定目标位姿,包括:基于K张第一图像分别匹配的M张第三图像,确定K张第一图像分别对应的N个候选位姿和N个候选位姿分别对应的单帧置信度,N为正整数;根据K张第一图像分别对应的N个候选位姿和N个候选位姿分别对应的单帧置信度,确定目标位姿。这样,采用旋转多帧的方式进行视觉定位,可以实现大视角视觉定位,进而提高视觉定位的成功率。尤其是能够提高光照变化场景/季节变化场景/视角尺度变化场景/重复纹理场景/弱纹理场景等场景中的视觉定位成功率。
根据第二方面,或者以上第二方面的任意一种实现方式,根据K张第一图像分别对应的N个候选位姿和N个候选位姿分别对应的单帧置信度,确定目标位姿,包括:遍历K张第一图像分别对应的N个候选位姿,从任意两张第一图像分别对应的N个候选位姿中分别选取一个候选位姿组成一个位姿组合,以得到多个位姿组合;针对一个第一位姿组合,确定第一位姿组合对应两张第一图像之间的同时定位与建图SLAM位姿,以及第一位姿组合中两个候选位姿之间的相对位姿;确定SLAM位姿与相对位姿之间的位姿误差;若存在位姿误差小于预设误差的候选位姿组合,则根据候选位姿组合对应的单帧置信度,确定候选位姿组合对应的联合置信度;将联合置信度最高的候选位姿组合,确定为目标位姿。
根据第二方面,或者以上第二方面的任意一种实现方式,该方法还包括:若不存在位姿误差小于预设误差的候选位姿组合,则将K张第一图像分别对应的N个候选位姿中单帧置信度最高的候选位姿,确定为目标位姿。
根据第二方面,或者以上第二方面的任意一种实现方式,基于K张第一图像分别匹配的M张第三图像,确定K张第一图像分别对应的N个候选位姿,包括:针对K张第一图像中的目标图像:将目标图像匹配的M张第三图像进行共视聚类,以得到N组第三图像;基于目标图像和N组第三图像,确定目标图像对应的N个候选位姿。
根据第二方面,或者以上第二方面的任意一种实现方式,基于K张第一图像分别匹配的M张第三图像,确定K张第一图像分别对应的N个候选位姿,包括:针对K张第一图像中的目标图像:基于目标图像和目标图像匹配的M张第三图像,确定目标图像对应的M个候选位姿;基于目标图像对应的M个候选位姿进行聚类,以得到目标图像对应的N个候选位姿。
根据第二方面,或者以上第二方面的任意一种实现方式,基于第一图像和第三图像,确定目标位姿,包括:确定本次视觉定位采用的第一图像对应的单帧置信度最高的R个第一候选位姿,R为正整数;将上一次视觉定位采用的第一图像对应的单帧置信度最高的R个第二候选位姿,分别增加SLAM位姿,以得到R个第三候选位姿;确定R个第三候选位姿分别对应的概率,以及R个第一候选位姿分别对应的概率;将概率最高的第一候选位姿或第三候选位姿,确定为目标位姿。这样,联合上一次视觉定位确定的多个候选位姿,进行本次视觉定位,来降低单帧视觉定位的误差,提高视觉定位的成功率。尤其可以提高光照变化场景/季节变化场景/视角尺度变化场景/重复纹理场景/弱纹理场景等场景中的视觉定位的成功率。
第三方面,本申请实施例提供一种导航地图构建方法,该方法包括:采集场所的场所标识信息,场所标识信息包括场所标识文本和/或场所标识图形;在预设的多张第一图像中进行多模态检索,以确定包含场所标识信息的第二图像,多张第一图像是在构建视觉定位地图过程中采集的;根据第二图像中场所标识信息,确定场所在视觉定位地图中的3D坐标;将场所在视觉定位地图中的3D坐标,映射至2D视觉导航地图中,以得到场所在2D视觉导航地图中的2D坐标,2D视觉导航地图根据视觉定位地图生成。这样,可以将场所的位置注册至视觉定位图像和2D视觉导航地图,进而在用户使用AR地图导航时,能够提供AR视觉导航。
根据第三方面,采集场所内的地图重建数据,该方法还包括:根据地图重建数据进行三维重建,以更新视觉定位地图,地图重建数据包括场所内的第三图像,场所包含多个类别的对象;提取第三图像中类别标识文本对应的2D特征点,确定类别标识文本对应的2D特征点在SLAM坐标系中的3D点云;将SLAM坐标系中的3D点云,映射为更新
后的视觉定位地图对应视觉定位地图坐标系中的3D点云。这样,可以构建场所内的视觉定位地图,以便于后续指引用户快速达到较大场所内的各类对象所在位置。
根据第三方面,或者以上第三方面的任意一种实现方式,该方法还包括:获取类别的类别标识信息,类别标识信息包括类别标识文本和/或类别标识图形;在第三图像中进行多模态检索,以确定包含类别标识信息的第四图像;确定第四图像中类别标识信息,确定类别的对象在更新后的视觉定位地图中的3D坐标;依据更新后的视觉定位地图,更新2D视觉导航地图;将类别的对象在更新后的视觉定位地图中的3D坐标,映射至2D视觉导航地图中,以得到类别的对象在2D视觉导航地图中的2D坐标。这样,可以将场所内各类别对象的位置注册至视觉定位图像和2D视觉导航地图,进而在用户使用AR地图导航时,能够提供AR视觉导航,以指引用户快速达到较大场所内的各类对象所在位置。
需要说明的是,第三方面及第三方面的任意一种实现方式中的第一图像,和第一方法及第三方面的任意一种实现方式中的第二图像是命名不同的同一图像。第三方面及第三方面的任意一种实现方式中的第二图像,和第一方法及第三方面的任意一种实现方式中的第五图像是命名不同的同一图像。第三方面及第三方面的任意一种实现方式中的第三图像,和第一方法及第三方面的任意一种实现方式中的第六图像是命名不同的同一图像。第三方面及第三方面的任意一种实现方式中的第四图像,和第一方法及第三方面的任意一种实现方式中的第七图像是命名不同的同一图像。
第四方面,本申请实施例提供一种电子设备,包括:存储器和处理器,存储器与处理器耦合;存储器存储有程序指令,当程序指令由处理器执行时,使得电子设备执行第一方面或第一方面的任意可能的实现方式中的导航方法。
第四方面以及第四方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第四方面以及第四方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第五方面,本申请实施例提供一种电子设备,包括:存储器和处理器,存储器与处理器耦合;存储器存储有程序指令,当程序指令由处理器执行时,使得电子设备执行第二方面或第二方面的任意可能的实现方式中的视觉定位方法。
第五方面以及第五方面的任意一种实现方式分别与第二方面以及第二方面的任意一种实现方式相对应。第五方面以及第五方面的任意一种实现方式所对应的技术效果可参见上述第二方面以及第二方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第六方面,本申请实施例提供一种电子设备,包括:存储器和处理器,存储器与处理器耦合;存储器存储有程序指令,当程序指令由处理器执行时,使得电子设备执行第三方面或第三方面的任意可能的实现方式中的导航地图构建方法。
第六方面以及第六方面的任意一种实现方式分别与第三方面以及第三方面的任意一
种实现方式相对应。第六方面以及第六方面的任意一种实现方式所对应的技术效果可参见上述第三方面以及第三方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第七方面,本申请实施例提供一种芯片,包括一个或多个接口电路和一个或多个处理器;接口电路用于从电子设备的存储器接收信号,并向处理器发送信号,信号包括存储器中存储的计算机指令;当处理器执行计算机指令时,使得电子设备执行第一方面或第一方面的任意可能的实现方式中的导航方法。
第七方面以及第七方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第七方面以及第七方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第八方面,本申请实施例提供一种芯片,包括一个或多个接口电路和一个或多个处理器;接口电路用于从电子设备的存储器接收信号,并向处理器发送信号,信号包括存储器中存储的计算机指令;当处理器执行计算机指令时,使得电子设备执行第二方面或第二方面的任意可能的实现方式中的视觉定位方法。
第八方面以及第八方面的任意一种实现方式分别与第二方面以及第二方面的任意一种实现方式相对应。第八方面以及第八方面的任意一种实现方式所对应的技术效果可参见上述第二方面以及第二方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第九方面,本申请实施例提供一种芯片,包括一个或多个接口电路和一个或多个处理器;接口电路用于从电子设备的存储器接收信号,并向处理器发送信号,信号包括存储器中存储的计算机指令;当处理器执行计算机指令时,使得电子设备执行第三方面或第三方面的任意可能的实现方式中的导航地图构建方法。
第九方面以及第九方面的任意一种实现方式分别与第三方面以及第三方面的任意一种实现方式相对应。第九方面以及第九方面的任意一种实现方式所对应的技术效果可参见上述第三方面以及第三方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第十方面,本申请实施例提供一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,当计算机程序运行在计算机或处理器上时,使得计算机或处理器执行第一方面或第一方面的任意可能的实现方式中的导航方法。
第十方面以及第十方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第十方面以及第十方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第十一方面,本申请实施例提供一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,当计算机程序运行在计算机或处理器上时,使得计算机或处理器执行第二方面或第二方面的任意可能的实现方式中的视觉定位方法。
第十一方面以及第十一方面的任意一种实现方式分别与第二方面以及第二方面的任
意一种实现方式相对应。第十一方面以及第十一方面的任意一种实现方式所对应的技术效果可参见上述第二方面以及第二方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第十二方面,本申请实施例提供一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,当计算机程序运行在计算机或处理器上时,使得计算机或处理器执行第三方面或第三方面的任意可能的实现方式中的导航地图构建方法。
第十二方面以及第十二方面的任意一种实现方式分别与第三方面以及第三方面的任意一种实现方式相对应。第十二方面以及第十二方面的任意一种实现方式所对应的技术效果可参见上述第三方面以及第三方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第十三方面,本申请实施例提供一种计算机程序产品,所述计算机程序产品包括软件程序,当所述软件程序被计算机或处理器执行时,使得计算机或处理器执行第一方面或第一方面的任意可能的实现方式中的导航方法。
第十三方面以及第十三方面的任意一种实现方式分别与第一方面以及第一方面的任意一种实现方式相对应。第十三方面以及第十三方面的任意一种实现方式所对应的技术效果可参见上述第一方面以及第一方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第十四方面,本申请实施例提供一种计算机程序产品,所述计算机程序产品包括软件程序,当所述软件程序被计算机或处理器执行时,使得计算机或处理器执行第二方面或第二方面的任意可能的实现方式中的视觉定位方法。
第十四方面以及第十四方面的任意一种实现方式分别与第二方面以及第二方面的任意一种实现方式相对应。第十四方面以及第十四方面的任意一种实现方式所对应的技术效果可参见上述第二方面以及第二方面的任意一种实现方式所对应的技术效果,此处不再赘述。
第十五方面,本申请实施例提供一种计算机程序产品,所述计算机程序产品包括软件程序,当所述软件程序被计算机或处理器执行时,使得计算机或处理器执行第三方面或第三方面的任意可能的实现方式中的导航地图构建方法。
第十五方面以及第十五方面的任意一种实现方式分别与第三方面以及第三方面的任意一种实现方式相对应。第十五方面以及第十五方面的任意一种实现方式所对应的技术效果可参见上述第三方面以及第三方面的任意一种实现方式所对应的技术效果,此处不再赘述。
图1为示例性示出的应用场景示意图;
图2为示例性示出的导航过程示意图;
图3a为示例性示出的界面示意图;
图3b为示例性示出的界面示意图;
图4为示例性示出的导航过程示意图;
图5为示例性示出的视觉定位过程示意图;
图6为示例性示出的视觉定位过程示意图;
图7为示例性示出的视觉定位过程示意图;
图8为示例性示出的导航地图构建过程示意图;
图9为示例性示出的导航地图构建过程示意图;
图10为示例性示出的导航地图构建过程示意图;
图11为示例性示出的装置的结构示意图。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。
本申请实施例的说明书和权利要求书中的术语“第一”和“第二”等是用于区别不同的对象,而不是用于描述对象的特定顺序。例如,第一目标对象和第二目标对象等是用于区别不同的目标对象,而不是用于描述目标对象的特定顺序。
在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
在本申请实施例的描述中,除非另有说明,“多个”的含义是指两个或两个以上。例如,多个处理单元是指两个或两个以上的处理单元;多个系统是指两个或两个以上的系统。
图1为示例性示出的应用场景示意图。在图1的实施例中,示出了一种导航场景。
示例性的,当用户不熟悉去往目的地的路线时,可以使用手机等设备中的AR地图(可以是应用程序或小程序)进行导航。示例性的,用户可以启动AR地图并进入导航搜索界面101,如图1(1)所示。示例性的,导航搜索界面101可以包括一个或多个控件,包括但不限于:编辑框102和模态选择按钮103等,本申请对此不作限制。
示例性的,用户在编辑框102中输入位置搜索信息后,AR地图可以响应于用户操作,显示位置搜索结果列表108,如图1(3)所示。例如,用户在编辑框102中输入了“宿舍”后,位置搜索结果列表108中展示了多个位置搜索结果,例如:“B1宿舍楼”、“B2
宿舍楼”、“B3宿舍楼”如图1(3)所示。
示例性的,当用户想要去往“B1宿舍楼”时,可以点击位置搜索结果列表108中“B1宿舍楼”这一项的“导航”控件。此时,AR地图可以响应于用户操作,启动摄像头并进入AR视觉导航界面109,如图1(4)所示。示例性的,AR视觉导航界面109中可以包括导航提示信息选项110、导航指引标识111和摄像头采集的图像(包括AR视觉导航界面109中的树木、建筑物和道路)。这样,用户手持设备,按照AR视觉导航界面109中的导航提示信息选项110和导航指引标识111行进,即可以到达“B1宿舍楼”。需要说明的是,在用户行进的过程中,AR地图可以不断的根据摄像头采集的图像,更新导航提示信息选项110中的导航提示信息,以及导航指引标识111。
示例性的,本申请除了支持用户以上述的文本形式输入位置搜索信息外,还支持用户输入其他模态的位置搜索信息(如语音、图像等),以增加导航搜索的输入方式多样,进而能够提高用户导航体验。
示例性的,当用户想要输入除文本之外的其他模态的位置搜索信息时,可以点击图1(1)中模态选择按钮103;此时,AR地图可以响应于用户操作,显示模态选择窗口104,如图1(2)所示。示例性的,模态选择窗口104可以包括多个模态输入选项,包括但不限于语音输入选项105、图像输入选项106和拍照选项107等。示例性的,当用户需要采用语音输入位置搜索信息时,可以点击语音输入选项105;这样,AR地图可以响应于用户操作,启动录音模块,此时用户可以进行语音输入。示例性的,当用户需要将图像作为位置搜索信息时,可以点击图像输入选项106;这样,AR地图可以响应于用户操作,进入相册,此时用户可以从相册中选择想要输入图像。示例性的,当用户需要将图像作为位置搜索信息时,可以点击拍照选项107;这样,AR地图可以响应于用户操作,启动摄像头并进入拍照界面,此时用户可以通过拍照获取想要输入图像。
对应的,本申请提出一种导航方法,该导航方法可以根据用户输入的多模态的位置搜索信息进行搜索,为用户提供AR视觉导航。
图2为示例性示出的导航过程示意图。
S201,获取用户在AR地图中输入的位置搜索信息,位置搜索信息包括以下至少一种:文本、语音或图像。
示例性的,当用户需要导航时,可以启动设备中的AR地图并进入图1(1)中的导航搜索界面101。示例性的,用户可以在导航搜索界面101中的编辑框进行输入文本形式的位置搜索信息;也可以点击图1(1)中模态选择按钮103并在图1(2)的模态选择窗口104中点击语音输入选项105,以使用语音输入位置搜索信息;还可以在图1(2)的模态选择窗口104中点击图像输入选项106/拍照选项107,以输入图像形式的位置搜索信息。
需要说明的是,用户可以在导航搜索界面101同时以文本、语音和图像中的任意两种或三种形式输入位置搜索信息,本申请对此不作限制。
示例性的,位置搜索信息可以是指用于搜索位置的信息,可以包括但不限于:实体名称文本、实体名称语音、实体图像和地址信息等等,本申请对此不作限制。其中,实体可以是指能够独立存在的、作为一切属性的基础和万物本原的东西,例如,客观存在
并可相互区别的事物。实体可以包括场所(如商店、动物园、植物园等),还可以包括场所包含的对象(如商店中的商品、动物园中的动物、植物园中的植物等)。
示例性的,实体名称文本/实体名称语音可以包括场所名称文本/语音,如“**商店”的文本/语音、“**动物园”的文本/语音、“**植物园”的文本/语音等。示例性的,实体名称文本/实体名称语音可以包括场所包含的对象名称文本/语音,如“**品牌帆布鞋”的文本/语音、“狮子”的文本/语音、“紫荆花”的文本/语音等。
示例性的,实体图像可以包括场所图像,如商店门牌图像、动物园大门图像、植物园大门图像等。示例性的,实体图像可以包括场所包含的对象图像,如**品牌的帆布鞋的图像、狮子的图像、紫荆花的图像等。
示例性的,待用户在导航搜索界面101中完成位置搜索信息的输入后,AR地图即可以获取到用户输入的位置搜索信息;然后可以执行S202。
S202,基于位置搜索信息,在预设的多模态信息中进行多模态搜索,以确定位置搜索信息匹配的位置搜索结果。
示例性的,可以预先收集多个实体的实体标识信息,并将实体标识信息与对应的实体地址进行关联。其中,针对每个实体,可以收集该实体的实体标识文本,也可以收集该实体的实体标识图像,还可以收集该实体的实体标识文本和实体标识图像;本申请对此不作限制。其中,实体标识文本可以是指可以用于标识实体的文本,实体标识图像可以是指包含实体标识文本/图形的图像。
示例性的,实体标识信息可以包括场所标识信息和对象标识信息。场所标识信息可以包括场所标识文本和/或场所标识图像,对象标识信息可以包括对象标识文本和/或对象标识图像。
示例性的,还可以将任一场所的场所标识信息和该场所包含的对象的对象标识信息进行关联;以及将任一场所的场所标识信息和该场所包含的对象的对象标识信息,与该场所的地址进行关联。
例如,场所为大型超市,场所包含的对象为超市售卖的商品;可以将对象标识信息(如商品名称、商品图像等)与场所标识信息(如超市名称、超市门牌图像等)关联,以及将对象标识信息(如商品名称、商品图像等)和场所标识信息(如超市名称、超市门牌图像等),与超市地址关联。
例如,场所为大型动物园,场所包含的对象为动物园内的动物;可以将对象标识信息(如动物名称、动物图像等)与场所标识信息(如动物园名称、动物园门牌图像等)关联,以及将对象标识信息(如动物名称、动物图像等)和场所标识信息(如动物园名称、动物园门牌图像等),与动物园地址关联。
这样,用户在输入位置搜索信息时,无论是输入场所名称文本/语音或者场所图像,还是输入对象名称文本/语音或者对象图像,AR地图均可以快速的为用户提供准确的导航。
示例性的,收集的多个实体的实体标识信息(实体标识文本和/或实体标识图像),可以组成多模态信息。
示例性的,在获取到位置搜索信息后,可以从预设的多模态信息进行多模态搜索,
从预设的多模态信息中查找与位置搜索信息匹配的位置搜索结果。
示例性的,多模态搜索可以包括同模态搜索和跨模态搜索。
示例性的,同模态搜索的过程可以如下:当位置搜索信息为文本时,可以从预设的多模态信息所包含的实体标识文本中进行搜索,以查找与位置搜索信息匹配的实体标识文本,作为位置搜索结果。当位置搜索信息为语音时,可以先进行语音识别,得到识别文本;再从预设的多模态信息所包含的实体标识文本中进行搜索,以查找与识别文本匹配的实体标识文本,作为位置搜索结果。当位置搜索信息为图像时,可以从预设的多模态信息所包含的实体标识图像中进行搜索,查找与位置搜索信息匹配的实体标识图像,作为位置搜索结果。
示例性的,跨模态搜索的过程可以如下:当位置搜索信息为文本或图像时,从预设的多模态信息包含的实体标识文本和实体标识图像中进行搜索,以查找与位置搜索信息匹配的实体标识文本和/或实体标识图像,作为位置搜索结果。当位置搜索信息为语音时,可以先进行语音识别,得到识别文本;再从预设的多模态信息包含的实体标识文本和实体标识图像中进行搜索,以查找与位置搜索信息匹配的实体标识文本和/或实体标识图像作为位置搜索结果。
示例性的,位置搜索结果可以包括一个或多个。例如,可以在图1(3)的位置搜索结果列表108中展示位置搜索结果。
S203,根据位置搜素结果进行AR视觉导航。
示例性的,当位置搜索结果包括多个时,用户可以按照需求选择一个位置搜索结果;此时,可以根据用户选择的位置搜索结果进行AR视觉导航。当位置搜索结果为一个时,可以直接根据S202确定的位置搜索结果进行AR视觉导航。
图3a和图3b为示例性示出的界面示意图。
示例性的,当用户想买要黑色耳机时,可以在图1(1)中导航搜索界面101中的编辑框输入位置搜索信息“黑色耳机”,如图3a(1)所示。此时,基于“黑色耳机”在预设的多模态信息中进行多模态搜索,确定的与位置搜索信息匹配的位置搜索结果为2张耳机图像:图像1和图像2;并在图1(3)的位置搜索结果列表108中展示,如图3a(2)所示。若用户选择图像1,则可以为用户去往售卖图像1中耳机的店铺进行AR视觉导航。
示例性的,当用户手机存储了某一款鞋子的图像,但是不知道该款鞋子的品牌时,可以在图1(2)的模态选择窗口104中点击图像输入选项106;然后从展示的相册中选取该款鞋子的图像,如图3b(1)所示。此时,基于该款鞋子的图像在预设的多模态信息中进行多模态搜索,确定的与位置搜索信息匹配的位置搜索结果为3张店铺的图像:图像3、图像4和图像5,并在图1(3)的位置搜索结果列表108中展示,如图3b(2)所示。若用户选择图像5,则可以为用户去往图像5对应店铺进行AR视觉导航。
这样,相对于现有技术仅能输入文本进行导航而言,本申请可以支持用户输入多种模态的位置搜索信息,导航搜索的输入方式多样,能够提高用户导航体验。
示例性的,位置搜索信息可以是实体名称文本、实体名称语音或者实体图像,实体包括场所和场所包含的对象。这样,AR地图除了可以在用户输入场所名称文本/场所名
称语音/场所图像为用户提供AR视觉导航之外,还可以在用户输入对象图像或对象名称文本或对象名称语音是为用户提供AR视觉导航。进而,能够快速引导用户至自身所期望的实体的位置。
图4为示例性示出的导航过程示意图。在图4的实施例中,具体描述了跨模态搜索的过程以及进行AR视觉导航过程。
S401,获取用户在增强现实AR地图中输入的位置搜索信息,位置搜索信息包括以下至少一种:文本、语音和图像。
示例性的,S401可以参照上述S301的描述,在此不再赘述。
示例性的,基于位置搜索信息,在预设的多模态信息中进行跨模态搜索,以确定位置搜索信息匹配的位置搜索结果的过程,可以参照如下S402~S404:
S402,对位置搜索信息进行特征提取,以得到位置搜索信息对应的第一特征。
S403,对多个实体标识信息进行特征提取,以得到多个实体标识信息分别对应的第二特征。
S404,确定多个实体标识信息分别对应的第二特征,与第一特征之间的第一特征距离。
S405,从多个实体标识信息中,选取对应第一特征距离小于第一距离阈值的实体标识信息作为位置搜索结果。
示例性的,可以预先训练跨模态搜索模型,然后采用训练后的跨模态搜索模型实现跨模态搜索。
示例性的,可以收集多个实体的实体名称文本和实体图像;以及将同一实体的实体名称文本和实体图像,作为一组训练数据,这样,可以得到多组训练数据。以下以采用一组训练数据对跨模态搜索模型进行训练为例进行示例性说明。
示例性的,可以将一组训练数据的实体名称文本和实体图像输入跨模态搜索模型,然后一方面,跨模态搜索模型对实体名称文本进行特征提取,以得到实体名称文本对应的文本特征;另一方面,跨模态搜索模型对实体图像进行特征提取,以得到实体图像对应的图像特征。接着,跨模态搜索模型可以计算实体名称文本对应的文本特征,与实体图像对应的图像特征之间的距离;随后,以最小化实体名称文本对应的文本特征与实体图像对应的图像特征之间的距离为目标,对跨模态搜索模型进行反向传播,以调整跨模态搜索模型的模型参数。进而可以按照这种方式,采用多组训练数据对跨模态搜索模型进行训练;这样,能够使得跨模态搜索模型能够学习到如何将同一实体对应的图像特征和文本特征统一到同一个特征空间。从而,采用训练后的跨模态搜索模型,能够实现图像和文本之间相互搜索。
示例性的,每次可以将一个实体标识信息和位置搜索信息输入至训练后的跨模态搜索模型中;接着,一方面,跨模态搜索模型对位置搜索信息进行特征提取,以得到位置搜索信息对应的第一特征;另一方面,跨模态搜索模型对该实体标识信息进行特征提取,以得到该实体标识信息对应的第二特征。随后,跨模态搜索模型可以计算第一特征与第二特征之间的第一特征距离并输出。这样,可以得到各实体标识信息对应的第二特征与
第一特征之间的第一特征距离。
应该理解的是,当位置搜索信息是语音时,可以在将位置搜索信息进行语音识别得到识别文本后,再将识别文本输入至训练后的跨模态搜索模型。
示例性的,可以确定对应的第二特征与第一特征之间第一特征距离小于第一距离阈值的实体标识信息;然后将对应的第二特征与第一特征之间第一特征距离小于第一距离阈值的实体标识信息,确定为位置搜索信息匹配的位置搜索结果。
示例性的,多模态信息中部分实体标识信息仅包括的实体标识文本,部分实体标识信息仅包括实体标识图像,以及部分实体标识信息包括实体标识文本和实体标识图像。为了便于描述,将包含实体标识文本和实体标识图像的实体标识信息,称为第一实体标识信息。
示例性的,对于第一实体标识信息,将第一实体标识信息中的实体标识文本和位置搜索信息输入跨模态搜索模型后,可以得到第一实体标识信息中的实体标识文本的第二特征与第一特征之间的第一特征距离。将第一实体标识信息中的实体标识图像和位置搜索信息输入跨模态搜索模型后,可以得到第一实体标识信息中的实体标识图像的第二特征与第一特征之间的第一特征距离。然后可以将第一实体标识信息中的实体标识图像的第二特征与第一特征之间的第一特征距离,与第一实体标识信息中的实体标识文本的第二特征与第一特征之间的第一特征距离进行加权计算;然后将加权计算结果,作为第一实体标识信息对应的第二特征与第一特征之间的第一特征距离。
示例性的,第一特征距离越小,说明第二特征与第一特征越相似,也就是说,对应的实体标识信息和位置搜索信息越相似。进而,可以从多个实体标识信息中,选取对应第一特征距离小于第一距离阈值的实体标识信息作为位置搜索结果,以将与位置搜索信息相似程度高的实体标识信息作为位置搜索结果。其中,第一距离阈值可以根据需求设置,本申请对此不作限制。其中,
这样,通过跨模态搜索,可以实现“模糊”搜索,进而能够尽可能覆盖比较全面的实体标识信息,避免出现遗漏。
示例性的,根据位置搜索结果进行AR视觉导航的过程可以如下S406~S409的步骤:
S406,基于预先生成的2D视觉导航地图,生成用户的当前位置和位置搜索结果对应目标位置之间的2D导航路径。
示例性的,可以获取位置搜索结果对应的目标位置;然后根据预先构建的2D视觉导航地图,生成用户的当前位置和位置搜索结果对应目标位置之间的2D导航路径。其中,2D视觉导航地图的生成过程在后续说明。
示例性的,当位置搜索结果为场所标识信息时,可以获取场所标识信息对应场所的位置,作为目标位置(例如,位置搜索信息为“**超市”,则可以将“**超市”对应的位置,作为目标位置)。当位置搜索结果为对象标识信息时,可以获取对象标识信息对应对象所属场所的位置,作为目标位置(例如,位置搜索信息为“**薯片”,则可以将售卖“**薯片”的超市对应的位置,作为目标位置)。当位置搜索结果为对象标识信息时,可以获取对象标识信息对应对象所属类别所在位置,作为目标位置(例如,位置搜索信息为“**薯片”,则可以将超市中“休闲食品”对应的位置,作为目标位置)。
示例性的,用户的当前位置可以由设备中定位模块获取。
S407,进行视觉定位,以确定目标位姿,目标位姿是指设备的当前位姿。
示例性的,在确定用户的当前位置和位置搜索结果对应目标位置之间的2D导航路径之后,可以进行视觉定位,以确定设备的当前位姿;其中,设备的当前位姿,可以用于表征用户的当前位姿,进而后续能够根据用户的当前位姿和2D导航路径,进行AR视觉导航。
示例性的,可以启动摄像头,调用摄像头采集图像;然后根据采集的图像进行视觉定位,以确定目标位姿。为了便于后续说明,可以将视觉定位过程中,摄像头采集的图像称为第一图像。示例性的,在启动摄像头的同时,还可以提醒用户将设备的摄像头朝向前方,这样,根据第一图像进行视觉定位确定的目标位姿,能够更接近用户的当前位姿。
示例性的,摄像头采集第一图像的周期,可以是设备系统预设的(例如周期为0.5ms,即摄像头每隔0.5ms采集一帧图像),本申请对此不作限制。
示例性的,可以按照预设的视觉定位周期进行视觉定位,其中,视觉定位周期可以按照需求设置,例如10s、15s等,本申请对此不作限制。
示例性的,每达到一次预设周期时,可以获取摄像头与当前时刻距离最近的时刻所采集的第一图像,然后基于第一图像进行视觉定位,以确定设备采集第一图像时的位姿(后续称为目标位姿)。其中,视觉定位过程在后续进行说明。
S408,基于2D导航路径和目标位姿,进行AR视觉导航。
示例性的,可以在图1(4)的AR视觉导航界面109显示摄像头采集的第一图像,以及根据2D导航路径和目标位姿,更新导航提示信息选项110中的导航指示信息。
示例性的,图1(4)的AR视觉导航界面109中第一图像的更新周期,与摄像头采集第一图像的周期相同。此外,每进行一次视觉定位后,可以更新一次导航提示信息选项110中的导航指示信息;也就是说,导航提示信息选项110中的导航指示信息的更新周期,和视觉定位周期相同。
需要说明的是,AR视觉导航界面109第一次展示的导航提示信息选项110中的导航提示信息,可以根据S406中生成的2D导航路径和定位模块获取的用户的当前位置,生成的。后续,在完成第一次视觉定位后,可以根据2D导航路径和第一次视觉定位确定的目标位姿,更新导航提示信息选项110中的导航提示信息。
以下对视觉定位过程进行示例性说明。
本申请提出一种视觉定位方法,通过结合多模态信息进行视觉定位,来提升光照变化场景/季节变化场景/视角尺度变化场景/重复纹理场景/弱纹理场景等场景中的视觉定位成功率。
图5为示例性示出的视觉定位过程示意图。在图5的实施例中,描述了视觉定位过程。
由于视觉定位是基于视觉定位地图实现的;因此可以先对视觉定位地图进行示例性介绍。
示例性的,可以控制数据采集设备(如图像采集设备、位置采集设备等等)行走于各场所(场所外部行走),以采集各场所的图像等数据。接着,根据采集到的数据进行三维重建,得到视觉定位地图(视觉定位地图为3D地图)。
示例性的,在构建视觉定位地图的同时,还可以提取数据采集设备采集的图像(后续称为第二图像)中的2D特征点,以及确定2D特征点对应在视觉定位地图中的3D点云(可以包括3D坐标);以及记录第二图像的GPS(Global Positioning System,全球定位系统)信息。然后基于第二图像、第二图像包含的2D特征点、2D特征点的描述子(是指用于描述特征点的纹理颜色信息以及其他信息的向量)、2D特征点对应的3D点云、第二图像的GPS,以及第二图像与GPS信息之间的关系的索引表等信息,存储至视觉定位地图数据库。
示例性的,可选地,还可以提取第二图像中的第二文本;以及还可以提取第二图像的全局特征,以得到第二图像的第二全局特征向量。然后可以将第二图像中的第二文本和第二图像的第二全局特征向量,存储至视觉定位地图数据库中。
示例性的,在得到视觉定位地图后,将视觉定位地图数据库中的3D点云映射至2D平面,可以得到2D视觉导航地图。
示例性的,可以参照如下步骤S501~S504的描述,对视觉定位过程进行说明。
S501,采集第一图像。
示例性的,在视觉定位过程中,可以启动摄像头,调用摄像头采集第一图像。
S502,提取第一图像中的第一文本和提取第一图像的第一全局特征向量。
示例性的,可以通过图像检索,从视觉定位地图数据库中检索出与第一图像匹配的第二图像(其中,与第一图像匹配的第二图像可以是指,与第一图像拍摄角度相似且拍摄距离相近的第二图像);然后再根据第一图像和图像检索得到的第二图像,确定设备采集第一图像时的目标位姿。
示例性的,可以根据第一图像和第二图像中具有高阶语义的信息以及全局特征,进行图像检索;具体的,可以根据具有高阶语义的信息,对多张第二图像进行第一次过滤;然后再根据全局特征,对多张第二图像进行第二过滤。这样,可以提高图像检索的准确性和检索速率。
示例性的,一方面,可以提取第一图像中的具有高阶语义的信息(如车位编号、商店门牌文本),以得到第一图像的第一文本。另一方面,可以对第一图像进行全局特征提取,以得到第一图像的第一全局特征向量。
示例性的,可以采用已训练的特征提取网络,来提取第一图像的第一全局特征向量。示例性的,特征提取网络的部分网络层中每个网络层包括图像中各个区域对应的权重;可以确定第一文本在第一图像中对应的目标区域;接着,增加已训练的特征提取网络的网络层(如最后一个网络层)中目标区域对应的权重。然后,将第一图像输入至特征提取网络中,由特征提取网络提取第一图像的全局特征,输出第一图像的第一全局特征向量。
S503,根据第一文本、第一全局特征向量以及预设的多张第二图像中的第二文本和多张第二图像的第二全局特征向量进行图像检索,以从多张第二图像中选取第一图像匹
配的第三图像,多张第二图像是在构建视觉定位地图过程中采集的。
示例性的,当视觉定位地图数据库中包括第二图像的第二文本和第二图像的第二全局特征向量时,则可以先根据第一文本和第二文本,对多张第二图像进行进行第一次过滤;然后再根据第二全局特征向量和第一全局特征向量,对多张第二图像进行第二次过滤;以过滤出第一图像匹配的第二图像。示例性的,为了便于描述,将图像检索得到的与第一图像匹配的第二图像,称为第三图像。
示例性的,可以根据多张第二图像中的第二文本,从多张第二图像中选取包含第一文本的多张第四图像。例如,第一图像中的第一文本为车位编号“0372”,可以查找第二文本包括“0372”的第二图像,作为第四图像。接着,可以分别计算每张第四图像的第二全局特征向量与第一全局特征向量之间的第二特征距离。例如,第二特征距离可以是欧氏距离。这样,可以得到各张第四图像的第二全局特征向量与第一全局特征向量之间的第二特征距离。然后,从多张第四图像中,选取对应第二特征距离小于第二距离阈值的第三图像。由于第二特征距离越小,第一图像与第二图像的拍摄角度越相似、拍摄距离越接近,这样,可以选取出与第一图像拍摄角度相似程度高、且拍摄距离较为接近的第二图像,作为与第一图像匹配的第二图像(即第三图像)。其中,第三图像可以是一张,也可以是多张,本申请对此不作限制。其中,第二距离阈值可以按照需求设置,本申请对此不作限制。
示例性的,当视觉定位地图数据中不包括第二图像的第二文本和第二图像的第二全局特征向量时,可以提取每张第二图像中的第二文本,以及提取第二图像的第二全局特征向量。
应该理解的是,提取第二图像中的第二文本的方式和提取第一图像中的第一文本的方式类似,以及提取第二图像的第二全局特征向量的方式和提取第一图像的第一全局特征向量的方式类似,在此不再赘述。
S504,基于第一图像和第三图像,确定目标位姿,目标位姿是指采集第一图像时设备的位姿。
示例性的,可以基于第一图像和第三图像进行特征匹配和位姿估计,确定采集第一图像时设备的位姿。
示例性的,特征匹配的过程可以如下:
1)提取第一图像中的2D特征点。示例性的,可以对第一图像特征点检测,提取第一图像中的2D特征点。
2)生成第一图像中2D特征点的描述子。示例性的,针对第一图像中的每个2D特征点,可以生成用于描述2D特征点的纹理颜色信息以及其他信息的向量,即描述子。示例性的,同一特征点在不同图像上的描述子距离相近,而不同特征点的描述子则距离较远。
3)对第一图像与第三图像进行特征点匹配。示例性的,可以计算第一图像中2D特征点的描述子与第三图像中2D特征点的描述子的描述子距离;接着,将第三图像中与第一图像中2D特征点的描述子的描述子距离小于第三距离阈值的2D特征点,作为候选特征点。
4)内点筛选。其中,可以对候选特征点进行几何约束,以过滤匹配错误的候选特征点,得到目标特征点。
示例性的,位姿估计的过程如下:可以确定目标特征点对应的3D点云;然后根据目标特征点在第三图像上的位置和3D点云中的3D坐标,进行位姿求解,以得到目标位姿。
由于具有高阶语义的文本能够很好的区别光照变化场景/季节变化场景/视角尺度变化场景/重复纹理场景/弱纹理场景等场景中的不同图像,进而相对于现有技术仅根据图像这一种模态的信息进行视觉定位而言,本申请通过图像和图像中具有高阶语义的文本等多种模态的信息进行视觉定位,能够有效提升这些场景下的视觉定位成功率。
应该理解的是,图5实施例中描述的视觉定位方法,可以应用于图4实施例中;这样,通过提升光照变化场景/季节变化场景/视角尺度变化场景/重复纹理场景/弱纹理场景等场景中视觉定位成功率,来提升AR视觉导航的准确性,以提高用户导航体验。
示例性的,图4中的S409是在SLAM(Simultaneous Localization and Mapping,同时定位与建图)坐标系下执行的(可以由SLAM系统执行),而得到目标位姿是视觉定位坐标系中计算得到的;进而可以将目标位姿由视觉定位坐标系转换至SLAM坐标系,以便于后续SLAM系统基于第一图像、2D导航路径和目标位姿,进行AR视觉导航。
需要说明的是,图5的视觉定位方法还可以应用于其他场景,例如,全息信息显示场景、拍照场景、智能讲解场景、智能和智慧IoT(Internet of Things,物联网)场景、AR游戏交互场景等等,本申请对此不作限制。
示例性的,在每次视觉定位的过程中,可以采集多帧第一图像,然后联合多帧来进行视觉定位,来进一步提升视觉定位的成功率。
图6为示例性示出的视觉定位过程示意图。在图6的实施例中,可以在用户旋转设备的过程中,设备采集多帧第一图像;进而根据旋转多帧第一图像进行视觉定位。
S601,在设备旋转过程中,采集K1张第一图像。
示例性的,可以在每次视觉定位时,提示用户旋转设备,例如,在图1(4)中AR视觉导航界面109中显示提示信息,如“请旋转摄像头”;也可以进行语音提示等等,本申请对此不作限制。然后在用户旋转设备的过程中,采集K1张第一图像。其中,K1为大于1的整数。
S602,提取K1张第一图像中的第一文本和提取K1张第一图像的第一全局特征向量。
示例性的,可以参照上述S502,提取K1张第一图像中每张第一图像的第一文本,以及提取每张第一图像的第一全局特征向量。
S603,根据K1张第一图像中的第一文本、K1张第一图像的第一全局特征向量以及预设的多张第二图像中的第二文本和多张第二图像的第二全局特征向量进行图像检索,以从多张第二图像中选取与K1张第一图像分别匹配的M张第三图像。
示例性的,可以参照上述S503,从多张第二图像中选取与每张第一图像匹配的第三图像。示例性的,与每张第一图像匹配的第三图像为M(M为正整数)张,其中,与每张第一图像匹配的第三图像的数量可以相同,也可以不同,本申请对此不作限制。
S604,基于K1张第一图像分别匹配的M张第三图像,确定K1张第一图像分别对应的N个候选位姿和N个候选位姿分别对应的单帧置信度;N为正整数。
示例性的,可以依次将K1张第一图像中的每张第一图像,确定为目标图像;然后可以根据目标图像和目标图像匹配的M张第三图像,确定目标图像对应的N个候选位姿。
一种可能的方式中,可以先将目标图像匹配的M张第三图像进行共视聚类,以得到N组第三图像;然后按照S504的方式,基于目标图像和一组第三图像,确定目标图像对应的一个候选位姿,这样,可以确定目标图像对应的N个候选位姿。
示例性的,共视聚类可以是指以保证任意两组第三图像包含部分共同的3D点云且3D点云个数大于或等于1的目标进行的聚类。
一种可能的方式中,可以先按照S504的方式,基于目标图像和目标图像匹配的M张第三图像,确定目标图像对应的M个候选位姿;然后再基于目标图像对应的M个候选位姿进行聚类,以得到目标图像对应的N个候选位姿。
示例性的,在确定目标图像对应的N个候选位姿后,可以确定N个候选位姿分别对应的单帧置信度。
示例性的,可以参照如下公式计算一个候选位姿对应的单帧置信度:
单帧置信度=sigmoid(Y),Y=x/10
单帧置信度=sigmoid(Y),Y=x/10
其中,x表示位姿对应的内点数(进行内点筛选后得到第三图像与第一图像匹配的特征点数),其中,sigmoid(Y)=1/(1+exp(-Y))。
进而,按照上述方法,可以确定K张第一图像中每张第一图像对应的N个候选位姿和N个候选位姿分别对应的单帧置信度。
S605,根据K张第一图像分别对应的N个候选位姿和N个候选位姿分别对应的单帧置信度,确定目标位姿。
示例性的,可以遍历K张第一图像分别对应的N个候选位姿,从任意两张第一图像分别对应的N个候选位姿中,分别选取一个候选位姿组成一个位姿组合,以得到多个位姿组合。例如,可以从第i1张第一图像对应的N个候选位姿中选取第j1个候选位姿,以及从第i2张第一图像对应的N个候选位姿中选取第j2个候选位姿;然后采用第j1个候选位姿和第j2个候选位姿,组成一个位姿组合。其中,其中,i1和i2是1~K之间的正整数,i1不等于i2;j1和j2是1~M之间的正整数,j1不等于j2。这样,遍历K张第一图像分别对应的N个候选位姿,可以得到多个位姿组合。
示例性的,在得到多个位姿组合后,针对多个位姿组合中的一个第一位姿组合,可以计算该第一位姿组合包含的两个候选位姿之间的相对位姿,也就是两张图像之间预测的相对位姿。在采集到K张第一图像后,SLAM系统可以输出K张第一图像中任意两张图像之间的相对位姿,也就是两张图像之间真实的相对位姿,即SLAM位姿。进而可以获取第一位姿组合对应的两张第一图像之间的SLAM位姿;然后,可以计算第一位姿组合中两个候选位姿之间的相对位姿,和第一位姿组合对应的两张第一图像之间的SLAM位姿的位姿误差,也就是计算预测的相对位姿和真实的相对位姿的误差。
示例性的,若存在位姿误差小于预设误差的候选位姿组合,则说明预测的相对位姿较为准确,候选位姿组合也可靠;此时可以根据候选位姿组合对应的单帧置信度,确定候选位姿组合对应的联合置信度;将联合置信度最高的候选位姿组合,确定为目标位姿。其中,预设误差可以按照需求设置,本申请对此不作限制。
示例性的,可以先确定候选位姿组合的联合置信度权重,然后根据候选位姿组合包含的两个候选位姿的单帧置信度和联合置信度权重,确定候选位姿组合对应的联合置信度。例如,采用候选位姿组合包含的两个候选位姿的单帧置信度与联合置信度权重相乘,得到候选位姿组合对应的联合置信度。
示例性的,可以根据候选位姿组合包含的两个候选位姿的单帧置信度,和正态分布函数,来确定联合置信度权重。例如,可以计算候选位姿组合包含的两个候选位姿的单帧置信度之间的相对位姿,将相对位姿作为正态分布函数的自变量,可以得到对应的概率分布,即联合置信度权重。
示例性的,正态分布函数的均值和方差可以按照需求设置,例如均值为0,方差为1,本申请对此不作限制。
这样,能够联合多帧的大视角视觉定位,进而提高视觉定位的成功率。尤其是能够提高光照变化场景/季节变化场景/视角尺度变化场景/重复纹理场景/弱纹理场景等场景中的视觉定位成功率。
示例性的,若不存在位姿误差小于预设误差的候选位姿组合,则说明预测的相对位姿不准确,候选位姿组合不可靠,此时可以将K张第一图像分别对应的N个候选位姿中单帧置信度最高的位姿,确定为目标位姿。
图7为示例性示出的视觉定位过程示意图。在图7的实施例中,在每次视觉定位过程中,联合本次视觉定位采用的第一图像和上一次视觉定位采用的第一图像实现。相对与图6的实施例而言,视觉定位过程无需用户配合,能够在提高视觉定位成功率的同时,提高用户体验,实用性更高。
S701,采集第一图像。
S702,提取第一图像中的第一文本和提取第一图像的第一全局特征向量。
S703,根据第一文本、第一全局特征向量以及预设的多张第二图像中的第二文本和多张第二图像的第二全局特征向量进行图像检索,以从多张第二图像中选取与第一图像匹配的第三图像,多张第二图像是在构建视觉定位地图过程中采集的。
示例性的,S701~S703,可以参照上述S501~S503的描述,在此不再赘述。
S704,确定本次视觉定位采用的第一图像对应的单帧置信度最高的R个第一候选位姿,R为正整数。
示例性的,S704可以参照S604的方式,确定本次视觉定位采用的第一图像对应的多个候选位姿;然后从第一图像对应的多个候选位姿中,选取单帧置信度最高的R个第一候选位姿。
示例性的,在得到本次视觉定位采用的第一图像对应的单帧置信度最高的R个第一候选位姿后,一方面可以执行S705~S707,以确定目标位姿;另一方面,可以保存这R
个第一候选位姿,以供后续视觉定位使用。
S705,将上一次视觉定位采用的第一图像对应的单帧置信度最高的R个第二候选位姿,分别增加SLAM位姿,以得到R个第三候选位姿。
示例性的,上一次视觉定位过程中,存储了上一次视觉定位采用的第一图像对应的单帧置信度最高的R个第二候选位姿;因此可以获取这R个第二候选位姿。然后将这R个候选位姿分别增加SLAM位姿,可以得到R个第三候选位姿。
S706,确定R个第三候选位姿分别对应的概率,以及R个第一候选位姿分别对应的概率。
示例性的,假设R等于3,记3个第二候选位姿分别为a1、a2和a3,3个第二候选位姿的单帧置信度分别为:Pa1、Pa1、Pa3。记3个第一候选位姿分别为b1、b2和b3,3个第一候选位姿的单帧置信度分别为:Pb1、Pb2、Pb3。以及记3个第三候选位姿分别为c1、c2和c3;其中,c1=a1+SLAM pose,c2=a2+SLAM pose,c3=a3+SLAM pose。
假设多帧定位成功概率为Pms,多帧定位失败概率为Pmf,单帧定位成功概率为Pss,单帧定位失败概率为Psf,将第一候选位姿与第二候选位姿的相对位姿,和SLAM位姿相同的概率为PΔrt。此时,第三候选位姿c1的概率为:Pa1*Psf*Pms,第三候选位姿c2的概率为:Pa2*Psf*Pms,第三候选位姿c3的概率为:Pa3*Psf*Pms。第一候选位姿b1的概率为:Pb1*Pa*Pss*PΔrt*Pms或Pb1*Pss*Pmf,第三候选位姿b2的概率为:Pb2*Pa*Pss*PΔrt*Pms或Pb2*Pss*Pmf,第三候选位姿b3的概率为:Pb3*Pa*Pss*PΔrt*Pms或Pb3*Pss*Pmf。
S707,将概率最高的第一候选位姿或第三候选位姿,作为本次视觉定位对应的目标位姿。
示例性的,从R个第一候选位姿和R个第三候选位姿中,选择概率最高的一个候选位姿(可能是第一候选位姿,也可能是第三候选位姿)作为本次视觉定位对应的目标位姿。
这样,通过联合本次视觉定位采用的第一图像和上一次视觉定位采用的第一图像视觉定位,来降低单帧视觉定位的误差,提高视觉定位的成功率。尤其可以提高光照变化场景/季节变化场景/视角尺度变化场景/重复纹理场景/弱纹理场景等场景中的视觉定位的成功率。
本申请还提供了一种导航地图的构建方法,将场所的位置注册到图5实施例生成的视觉定位地图和2D视觉导航地图中;这样,在用户使用AR地图导航时,AR地图能够提供AR视觉导航,即执行上述图2实施例的中的S203,或者上述图4实施例中的S406~S409。
图8为示例性示出的导航地图构建过程示意图。
S801,采集场所的场所标识信息,场所标识信息包括场所标识文本和/或场所标识图形。
示例性的,可以采集场所的场所标识信息。其中,场所标识信息可以是场所标识文本,也可以是场所标识图形。
例如,场所为超市时,场所标识文本可以是超市名称,场所标识图形可以是超市商
标。
例如,场所为动物园时,场所标识文本可以是动物园名称,场所标识图形可以是动物园商标。
S802,在预设的多张第一图像中进行多模态检索,以确定包含场所标识信息的第二图像,多张第一图像是在构建视觉定位地图过程中采集的。
示例性的,在构建视觉定位地图过程中,生成了视觉定位地图数据库;其中,视觉定位地图数据库可以包括数据采集设备采集的图像(后续称为第一图像)、第一图像包含的2D特征点、2D特征点对应的3D点云(可以包括3D坐标)以及第一图像的标识文本等等数据。具体可以参照上述图5的实施例中的描述,在此不再赘述。
示例性的,当场所标识信息为场所标识文本(如场所名称文本)时,可以根据视觉定位地图数据库中第一图像的标识文本,选取出包含场所名称文本的第一图像,作为第二图像。
示例性的,当场所标识信息为场所标识图形(如场所商标图形)时,可以分别提取视觉定位地图数据库中多张第一图像中包含的图形;然后根据第一图像所包含的图形,从多张第一图像中选取出包含场所标识图形的第一图像,作为第二图像。
例如,场所为超市,可以从多张第一图像中选取包含超市名称/商标的第一图像,作为第二图像。
例如,场所为动物园,可以从多张第一图像中选取包含动物园名称/商标的第一图像,作为第二图像。
例如,场所为植物园,可以从多张第一图像中选取包含植物园名称/商标的第一图像,作为第二图像。
示例性的,场所对应的第二图像可以包括一张或多张。
S803,根据第二图像中场所标识信息,确定场所在视觉定位地图中的3D坐标。
示例性的,从视觉定位地图数据库包含的2D特征点中,确定第二图像中场所标识信息对应的2D特征点;接着再从视觉定位地图数据库中,确定第二图像中场所标识信息对应的3D点云(包括3D坐标)。这样,然后可以将第二图像中场所标识信息对应的3D点云中的3D坐标,确定为场所在视觉定位地图中的3D坐标。
示例性的,当场所对应的第二图像包括多张时,可以得到多张第二图像中场所标识信息对应的3D点云;可以将每张第二图像中场所标识信息对应3D点云中的3D坐标的平均值,作为场所在视觉定位地图中的3D坐标。
S804,将场所在视觉定位地图中的3D坐标,映射至2D视觉导航地图中,以得到场所在2D视觉导航地图中的2D坐标。
一种可能方式中,可以直接将场所在视觉定位地图中的3D坐标映射2D平面,即进行3D到2D的转换,进而可以得到场所在2D视觉导航地图中的2D坐标。
一种可能的方式中,当第二图像包括多张时,可以分别将每张第二图像中场所标志信息对应3D点云中的3D坐标,映射至2D平面,可以得到多组2D坐标。然后,可以将多组2D坐标的平均值,可以得到场所在2D视觉导航地图中的2D坐标。
示例性的,针对较大的场所(如大型超市、大型动物园等等),可以构建场所内的导航地图;以便于后续指引用户快速达到较大场所内的各类对象所在位置。
图9为示例性示出的导航地图构建过程示意图。在图9的实施例中,示出了构建场所内的视觉定位地图的过程。
S901,采集场所内的地图重建数据,根据地图重建数据进行三维重建,以更新视觉定位地图,地图重建数据包括场所内的第三图像,场所包含多个类别的对象。
应该理解的是,构建场所内的视觉定位地图的方式,与上述图5实施例中描述的构建视觉定位地图类似,区别在于,在构建场所内的视觉定位地图中,是在场所内行走,以采集场所内的图像等数据;然后根据采集到的数据进行三维重建,得到场所内的视觉定位地图。
示例性的,可以将得到场所内的视觉定位地图,添加至上述图5实施例中视觉定位地图中,来对上述图5实施例中视觉定位地图进行更新,以得到更新后的视觉定位地图。其中,场所内的视觉定位地图,可以看做是上述图5实施例中视觉定位地图的子地图。
示例性的,用户可以在场所门口进行视觉定位,然后启动AR地图中的SLAM系统。示例性的,在SLAM系统运行过程中,可以由用户手持手机或者头戴AR眼镜等设备(摄像头处于开启状态),以遍历场所内所有类别的类别标识信息(可以包括类别标识文本和/或类别标识图形)为目标,在场所内行走采集,来采集场所内的第三图像。
例如,场所为大型超市时,可以在超市内部行走,遍历所有商品类别对应的类别名称文本或者类别商标图形。
例如,场所为大型动物园时,可以在动物园内部行走,遍历所有动物类别对应的类别名称文本或者类别简笔图形。
S902,提取第三图像中类别标识文本对应的2D特征点。
示例性的,可以进行对第三图像进行OCR,识别出第三图像中的类别标识文本。
例如,场所为大型超市,从第三图像中识别出的类别标识文本可以如“洗护用品”、“休闲食品”和“生鲜果蔬”等等。
例如,场所为大型动物园,从第三图像中识别出的类别标识文本可以如“海豹馆”、“企鹅馆”和“长颈鹿喂食区”等等。
示例性的,在识别出第三图像中的类别标识文本后,可以提取第三图像中这些类别标识文本对应的2D特征点。
S903,确定类别标识文本对应的2D特征点在SLAM坐标系中的3D点云。
示例性的,可以由SLAM系统进行计算,确定类别标识文本对应的2D特征点,在SLAM坐标系中的3D点云。
S904,将SLAM坐标系中的3D点云,映射为更新后的视觉定位地图对应视觉定位地图坐标系中的3D点云。
示例性的,SLAM坐标系和视觉定位地图坐标系是不同的坐标系,可以根据SLAM坐标系和视觉定位地图坐标系之间的转换关系,将SLAM坐标系中的3D点云,映射为更新后的视觉定位地图对应视觉定位坐标系中的3D点云。
以下对构建场所内各类别的对象对应的2D视觉导航地图的过程进行说明。
图10为示例性示出的导航地图构建过程示意图。
S1001,获取类别的类别标识信息,类别标识信息包括类别标识文本和/或类别标识图形。
S1002,在第三图像中进行多模态检索,以确定包含类别标识信息的第四图像。
S1003,确定第四图像中类别标识信息,确定类别的对象在更新后的视觉定位地图中的3D坐标。
S1004,依据更新后的视觉定位地图,更新2D视觉导航地图。
示例性的,可以将更新后的视觉定位地图中场所内的视觉定位地图包含的3D点云,映射到2D平面,可以得到场所内的2D视觉导航地图;然后将场所内的2D视觉导航地图,添加至图5实施例中的2D视觉导航地图中,来更新包含多个场所的2D视觉导航地图,以得到更新后的2D视觉导航地图。其中,场所内的2D视觉导航地图,可以看做是上述图5实施例中2D视觉导航地图的子地图。
S1005,将类别的对象在更新后的视觉定位地图中的3D坐标,映射至2D视觉导航地图中,以得到类别的对象在2D视觉导航地图中的2D坐标。
示例性的,S1001~S1003以及S1005,可以参照上述S801~S804的描述,在此不再赘述。
一个示例中,图11示出了本申请实施例的一种装置1100的示意性框图装置1100可包括:处理器1101和收发器/收发管脚1102,可选地,还包括存储器1103。
装置1100的各个组件通过总线1104耦合在一起,其中总线1104除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图中将各种总线都称为总线1104。
可选地,存储器1103可以用于存储前述方法实施例中的指令。该处理器1101可用于执行存储器1103中的指令,并控制接收管脚接收信号,以及控制发送管脚发送信号。
装置1100可以是上述方法实施例中的电子设备或电子设备的芯片。
其中,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。
本实施例还提供一种计算机可读存储介质,该计算机可读存储介质中存储有计算机指令,当该计算机指令在电子设备上运行时,使得电子设备执行上述相关方法步骤实现上述实施例中的导航方法和/或视觉定位方法和/或导航地图构建方法。
本实施例还提供了一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述相关步骤,以实现上述实施例中的导航方法和/或视觉定位方法和/或导航地图构建方法。
另外,本申请的实施例还提供一种装置,这个装置具体可以是芯片,组件或模块,该装置可包括相连的处理器和存储器;其中,存储器用于存储计算机执行指令,当装置运行时,处理器可执行存储器存储的计算机执行指令,以使芯片执行上述各方法实施例
中的导航方法和/或视觉定位方法和/或导航地图构建方法。
其中,本实施例提供的电子设备、计算机可读存储介质、计算机程序产品或芯片均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
通过以上实施方式的描述,所属领域的技术人员可以了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个装置,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是一个物理单元或多个物理单元,即可以位于一个地方,或者也可以分布到多个不同地方。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
本申请各个实施例的任意内容,以及同一实施例的任意内容,均可以自由组合。对上述内容的任意组合均在本申请的范围之内。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该软件产品存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。
结合本申请实施例公开内容所描述的方法或者算法的步骤可以硬件的方式来实现,也可以是由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(Random Access Memory,RAM)、闪存、只读
存储器(Read Only Memory,ROM)、可擦除可编程只读存储器(Erasable Programmable ROM,EPROM)、电可擦可编程只读存储器(Electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、只读光盘(CD-ROM)或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本申请实施例所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时,可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。计算机可读介质包括计算机可读存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是通用或专用计算机能够存取的任何可用介质。
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。
Claims (31)
- 一种导航方法,其特征在于,所述方法包括:获取用户在增强现实AR地图中输入的位置搜索信息,所述位置搜索信息包括以下至少一种:文本、语音或图像;基于所述位置搜索信息,在预设的多模态信息中进行多模态搜索,以确定所述位置搜索信息匹配的位置搜索结果;根据所述位置搜索结果进行AR视觉导航。
- 根据权利要求1所述的方法,其特征在于,所述多模态信息包括多个实体的实体标识信息,所述实体标识信息包括实体标识文本和/或实体标识图像;所述基于所述位置搜索信息,在预设的多模态信息中进行多模态搜索,以确定所述位置搜索信息匹配的位置搜索结果,包括:对所述位置搜索信息进行特征提取,以得到所述位置搜索信息对应的第一特征;对多个实体标识信息进行特征提取,以得到所述多个实体标识信息分别对应的第二特征;确定所述多个实体标识信息分别对应的第二特征,与所述第一特征之间的第一特征距离;从所述多个实体标识信息中,选取对应第一特征距离小于第一距离阈值的实体标识信息作为所述位置搜索结果。
- 根据权利要求1或2所述的方法,其特征在于,所述位置搜索信息包括以下至少一种:实体名称文本、实体名称语音或实体图像;所述实体包括:场所和/或所述场所包含的对象。
- 根据权利要求2所述的方法,其特征在于,所述确定所述多个实体标识信息分别对应的第二特征,与所述第一特征之间的第一特征距离,包括:针对所述多个实体标识信息中第一实体标识信息,所述第一实体标识信息包括实体标识文本和实体标识图像:将所述第一实体标识信息包含的实体标识文本所对应的第二特征与所述第一特征之间的第一特征距离,与所述第一实体标识信息包含的实体标识图像所对应的第二特征与所述第一特征之间的第一特征距离进行加权计算;将所述加权计算的结果,作为所述第一实体标识信息对应的第二特征与所述第一特征之间的第一特征距离。
- 根据权利要求1至4任一项所述的方法,其特征在于,所述根据所述位置搜索结果进行AR视觉导航,包括:基于预先生成的2D视觉导航地图,生成所述用户的当前位置和所述位置搜索结果对 应目标位置之间的2D导航路径;进行视觉定位,以确定目标位姿,所述目标位姿是指设备的当前位姿;基于所述所述2D导航路径和所述目标位姿,进行AR视觉导航。
- 根据权利要求5所述的方法,其特征在于,所述进行视觉定位,以确定目标位姿,包括:采集第一图像;提取所述第一图像中的第一文本和提取所述第一图像的第一全局特征向量;基于所述第一文本、所述第一全局特征向量以及预设的多张第二图像中的第二文本和所述多张第二图像的第二全局特征向量进行图像检索,以从所述多张第二图像中选取所述第一图像匹配的第三图像,所述多张第二图像是在构建视觉定位地图过程中采集的,所述2D视觉导航地图是基于所述视觉定位地图生成的;基于所述第一图像和所述第三图像,确定所述目标位姿,所述目标位姿是指采集所述第一图像时设备的位姿。
- 根据权利要求6所述的方法,其特征在于,所述根据所述第一文本、第一全局特征向量以及预设的多张第二图像中的第二文本和所述多张第二图像的第二全局特征向量进行图像检索,以从所述多张第二图像中选取所述第一图像匹配的第三图像,包括:根据所述多张第二图像中的第二文本,从所述多张第二图像中选取包含所述第一文本的多张第四图像;分别确定所述多张第四图像的第二全局特征向量,与所述第一全局特征向量之间的第二特征距离;从所述多张第四图像中,选取对应第二特征距离小于第二距离阈值的第三图像。
- 根据权利要求6或7所述的方法,其特征在于,所述提取所述第一图像的第一全局特征向量,包括:确定所述第一文本在所述第一图像中对应的目标区域;增加已训练的特征提取网络的网络层中所述目标区域对应的权重;将所述第一图像输入至所述特征提取网络,以得到所述特征提取网络输出的第一全局特征向量。
- 根据权利要求6至8任一项所述的方法,其特征在于,所述采集第一图像,包括:在所述设备旋转过程中,采集K张第一图像,每张第一图像匹配的第三图像为M张,K为大于1的整数,M为正整数;所述基于所述第一图像和所述第三图像,确定所述目标位姿,包括:基于所述K张第一图像分别匹配的M张第三图像,确定所述K张第一图像分别对应的N个候选位姿和所述N个候选位姿分别对应的单帧置信度,N为正整数;根据所述K张第一图像分别对应的N个候选位姿和所述N个候选位姿分别对应的单帧置信度,确定所述目标位姿。
- 根据权利要求9所述的方法,其特征在于,所述根据所述K张第一图像分别对应的N个候选位姿和所述N个候选位姿分别对应的单帧置信度,确定所述目标位姿,包括:遍历所述K张第一图像分别对应的N个候选位姿,从任意两张第一图像分别对应的N个候选位姿中分别选取一个候选位姿组成一个位姿组合,以得到多个位姿组合;针对一个第一位姿组合,确定所述第一位姿组合对应两张第一图像之间的同时定位与建图SLAM位姿,以及所述第一位姿组合中两个候选位姿之间的相对位姿;确定所述SLAM位姿与所述相对位姿之间的位姿误差;若存在位姿误差小于预设误差的候选位姿组合,则根据所述候选位姿组合对应的单帧置信度,确定所述候选位姿组合对应的联合置信度;将所述联合置信度最高的候选位姿组合,确定为所述目标位姿。
- 根据权利要求10所述的方法,其特征在于,所述方法还包括:若不存在位姿误差小于预设误差的候选位姿组合,则将所述K张第一图像分别对应的N个候选位姿中单帧置信度最高的候选位姿,确定为所述目标位姿。
- 根据权利要求6至8任一项所述的方法,其特征在于,所述基于所述第一图像和所述第三图像,确定所述目标位姿,包括:确定本次视觉定位采用的第一图像对应的单帧置信度最高的R个第一候选位姿,R为正整数;将上一次视觉定位采用的第一图像对应的单帧置信度最高的R个第二候选位姿,分别增加SLAM位姿,得到R个第三候选位姿;确定所述R个第三候选位姿分别对应的概率,以及所述R个第一候选位姿分别对应的概率;将概率最高的第一候选位姿或第三候选位姿,确定为所述目标位姿。
- 根据权利要求5至12任一项所述的方法,其特征在于,所述方法还包括:采集场所的场所标识信息,所述场所标识信息包括场所标识文本和/或场所标识图形;在预设的多张第二图像中进行多模态搜索,以确定包含所述场所标识信息的第五图像,所述多张第二图像是在构建视觉定位地图过程中采集的;根据所述第五图像中的所述场所标识信息,确定所述场所在所述视觉定位地图中的3D坐标;将所述场所在所述视觉定位地图中的3D坐标,映射至所述2D视觉导航地图中,以得到所述场所在所述2D视觉导航地图中的2D坐标。
- 根据权利要求6至13任一项所述的方法,其特征在于,所述方法还包括:采集场所内的地图重建数据,根据所述地图重建数据进行三维重建,以更新所述视觉定位地图,所述地图重建数据包括所述场所内的第六图像,所述场所包含多个类别的对象;提取所述第六图像中类别标识文本对应的2D特征点,以及确定所述类别标识文本对应的2D特征点在SLAM坐标系中的3D点云;将所述SLAM坐标系中的3D点云,映射为所述更新后的视觉定位地图对应视觉定位地图坐标系中的3D点云。
- 根据权利要求14所述的方法,其特征在于,所述方法还包括:采集所述类别的类别标识信息,所述类别标识信息包括类别标识文本和/或类别标识图形;在所述第六图像中进行多模态搜索,以确定包含所述类别标识信息的第七图像;根据所述第七图像中的所述类别标识信息,确定所述类别的对象在所述更新后的视觉定位地图中的3D坐标;依据所述更新后的视觉定位地图,更新所述2D视觉导航地图;将所述类别包含的对象在所述视觉定位地图中的3D坐标,映射至所述更新后的2D视觉导航地图中,以得到所述类别的对象在所述更新后的2D视觉导航地图中的2D坐标。
- 一种视觉定位方法,其特征在于,所述方法包括:采集第一图像;提取所述第一图像中的第一文本和提取所述第一图像的第一全局特征向量;基于所述第一文本、所述第一全局特征向量以及预设的多张第二图像中的第二文本和所述多张第二图像的第二全局特征向量进行图像检索,以从所述多张第二图像中选取所述第一图像匹配的第三图像,所述多张第二图像是在构建视觉定位地图过程中采集的;基于所述第一图像和所述第三图像,确定目标位姿,所述目标位姿是指采集第一图像时设备的位姿。
- 根据权利要求16所述的方法,其特征在于,所述基于所述第一文本、所述第一全局特征向量以及预设的多张第二图像中的第二文本和所述多张第二图像的第二全局特征向量进行图像检索,以从所述多张第二图像中选取所述第一图像匹配的第三图像,包括:根据所述多张第二图像中的第二文本,从所述多张第二图像中选取包含所述第一文本的多张第四图像;分别确定所述多张第四图像的第二全局特征向量,与所述第一全局特征向量之间的第二特征距离;从所述多张第四图像中,选取对应第二特征距离小于第二距离阈值的第三图像。
- 根据权利要求16或17所述的方法,其特征在于,所述提取所述第一图像的第一 全局特征向量,包括:确定所述第一文本在所述第一图像中对应的目标区域;增加已训练的特征提取网络的网络层中所述目标区域对应的权重;将所述第一图像输入至所述特征提取网络,以得到所述特征提取网络输出的第一全局特征向量。
- 根据权利要求16至18任一项所述的方法,其特征在于,所述采集第一图像,包括:在所述设备旋转过程中,采集K张第一图像,每张第一图像匹配的第三图像为M张,K为大于1的整数,M为正整数;所述基于所述第一图像和所述第三图像,确定目标位姿,包括:基于所述K张第一图像分别匹配的M张第三图像,确定所述K张第一图像分别对应的N个候选位姿和所述N个候选位姿分别对应的单帧置信度,N为正整数;根据所述K张第一图像分别对应的N个候选位姿和所述N个候选位姿分别对应的单帧置信度,确定所述目标位姿。
- 根据权利要求19所述的方法,其特征在于,所述根据所述K张第一图像分别对应的N个候选位姿和所述N个候选位姿分别对应的单帧置信度,确定所述目标位姿,包括:遍历所述K张第一图像分别对应的N个候选位姿,从任意两张第一图像分别对应的N个候选位姿中分别选取一个候选位姿组成一个位姿组合,以得到多个位姿组合;针对一个第一位姿组合,确定所述第一位姿组合对应两张第一图像之间的同时定位与建图SLAM位姿,以及所述第一位姿组合中两个候选位姿之间的相对位姿;确定所述SLAM位姿与所述相对位姿之间的位姿误差;若存在位姿误差小于预设误差的候选位姿组合,则根据所述候选位姿组合对应的单帧置信度,确定所述候选位姿组合对应的联合置信度;将所述联合置信度最高的候选位姿组合,确定为所述目标位姿。
- 根据权利要求20所述的方法,其特征在于,所述方法还包括:若不存在位姿误差小于预设误差的候选位姿组合,则将所述K张第一图像分别对应的N个候选位姿中单帧置信度最高的候选位姿,确定为所述目标位姿。
- 根据权利要求19至21任一项所述的方法,其特征在于,所述基于所述K张第一图像分别匹配的M张第三图像,确定所述K张第一图像分别对应的N个候选位姿,包括:针对所述K张第一图像中的目标图像:将所述目标图像匹配的M张第三图像进行共视聚类,以得到N组第三图像;基于所述目标图像和所述N组第三图像,确定所述目标图像对应的N个候选位姿。
- 根据权利要求19至21任一项所述的方法,其特征在于,所述基于所述K张第一图像分别匹配的M张第三图像,确定所述K张第一图像分别对应的N个候选位姿,包括:针对所述K张第一图像中的目标图像:基于所述目标图像和所述目标图像匹配的M张第三图像,确定所述目标图像对应的M个候选位姿;基于所述目标图像对应的M个候选位姿进行聚类,以得到所述目标图像对应的N个候选位姿。
- 根据权利要求16至18任一项所述的方法,其特征在于,所述基于所述第一图像和所述第三图像,确定目标位姿,包括:确定本次视觉定位采用的第一图像对应的单帧置信度最高的R个第一候选位姿,R为正整数;将上一次视觉定位采用的第一图像对应的单帧置信度最高的R个第二候选位姿,分别增加SLAM位姿,以得到R个第三候选位姿;确定所述R个第三候选位姿分别对应的概率,以及所述R个第一候选位姿分别对应的概率;将概率最高的第一候选位姿或第三候选位姿,确定为所述目标位姿。
- 一种导航地图构建方法,其特征在于,所述方法包括:采集场所的场所标识信息,所述场所标识信息包括场所标识文本和/或场所标识图形;在预设的多张第一图像中进行多模态检索,以确定包含所述场所标识信息的第二图像,所述多张第一图像是在构建视觉定位地图过程中采集的;根据所述第二图像中所述场所标识信息,确定所述场所在所述视觉定位地图中的3D坐标;将所述场所在所述视觉定位地图中的3D坐标,映射至2D视觉导航地图中,以得到所述场所在所述2D视觉导航地图中的2D坐标,所述2D视觉导航地图根据所述视觉定位地图生成。
- 根据权利要求25所述的方法,其特征在于,所述方法还包括:采集场所内的地图重建数据,根据所述地图重建数据进行三维重建,以更新所述视觉定位地图,所述地图重建数据包括所述场所内的第三图像,所述场所包含多个类别的对象;提取所述第三图像中类别标识文本对应的2D特征点,确定所述类别标识文本对应的2D特征点在SLAM坐标系中的3D点云;将所述SLAM坐标系中的3D点云,映射为所述更新后的视觉定位地图对应视觉定位地图坐标系中的3D点云。
- 根据权利要求26所述的方法,其特征在于,所述方法还包括:获取所述类别的类别标识信息,所述类别标识信息包括类别标识文本和/或类别标识图形;在所述第三图像中进行多模态检索,以确定包含所述类别标识信息的第四图像;确定所述第四图像中所述类别标识信息,确定所述类别的对象在所述更新后的视觉定位地图中的3D坐标;依据所述更新后的视觉定位地图,更新所述2D视觉导航地图;将所述类别的对象在所述更新后的视觉定位地图中的3D坐标,映射至2D视觉导航地图中,以得到所述类别的对象在所述2D视觉导航地图中的2D坐标。
- 一种电子设备,其特征在于,包括:存储器和处理器,所述存储器与所述处理器耦合;所述存储器存储有程序指令,当所述程序指令由所述处理器执行时,使得所述电子设备执行权利要求1至权利要求27中任一项所述的方法。
- 一种芯片,其特征在于,包括一个或多个接口电路和一个或多个处理器;所述接口电路用于从电子设备的存储器接收信号,并向所述处理器发送所述信号,所述信号包括存储器中存储的计算机指令;当所述处理器执行所述计算机指令时,使得所述电子设备执行权利要求1至权利要求27中任一项所述的方法。
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,当所述计算机程序运行在计算机或处理器上时,使得所述计算机或所述处理器执行权利要求1至权利要求27中任一项所述的方法。
- 一种计算机程序产品,其特征在于,所述计算机程序产品包含软件程序,当所述软件程序被计算机或处理器执行时,使得权利要求1至权利要求27任一项所述的方法的步骤被执行。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210709970.3 | 2022-06-22 | ||
CN202210709970.3A CN117333638A (zh) | 2022-06-22 | 2022-06-22 | 导航、视觉定位以及导航地图构建方法和电子设备 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023246537A1 true WO2023246537A1 (zh) | 2023-12-28 |
Family
ID=89281653
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/099610 WO2023246537A1 (zh) | 2022-06-22 | 2023-06-12 | 导航、视觉定位以及导航地图构建方法和电子设备 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN117333638A (zh) |
WO (1) | WO2023246537A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118258406A (zh) * | 2024-05-29 | 2024-06-28 | 浙江大学湖州研究院 | 一种基于视觉语言模型的自动导引车导航方法及装置 |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050094879A1 (en) * | 2003-10-31 | 2005-05-05 | Michael Harville | Method for visual-based recognition of an object |
CN107301865A (zh) * | 2017-06-22 | 2017-10-27 | 海信集团有限公司 | 一种用于语音输入中确定交互文本的方法和装置 |
CN109840287A (zh) * | 2019-01-31 | 2019-06-04 | 中科人工智能创新技术研究院(青岛)有限公司 | 一种基于神经网络的跨模态信息检索方法和装置 |
CN110017841A (zh) * | 2019-05-13 | 2019-07-16 | 大有智能科技(嘉兴)有限公司 | 视觉定位方法及其导航方法 |
CN112179330A (zh) * | 2020-09-14 | 2021-01-05 | 浙江大华技术股份有限公司 | 移动设备的位姿确定方法及装置 |
CN112270710A (zh) * | 2020-11-16 | 2021-01-26 | Oppo广东移动通信有限公司 | 位姿确定方法、位姿确定装置、存储介质与电子设备 |
CN113532442A (zh) * | 2021-08-26 | 2021-10-22 | 杭州北斗时空研究院 | 一种室内ar行人导航方法 |
CN113656546A (zh) * | 2021-08-17 | 2021-11-16 | 百度在线网络技术(北京)有限公司 | 多模态搜索方法、装置、设备、存储介质以及程序产品 |
-
2022
- 2022-06-22 CN CN202210709970.3A patent/CN117333638A/zh active Pending
-
2023
- 2023-06-12 WO PCT/CN2023/099610 patent/WO2023246537A1/zh unknown
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050094879A1 (en) * | 2003-10-31 | 2005-05-05 | Michael Harville | Method for visual-based recognition of an object |
CN107301865A (zh) * | 2017-06-22 | 2017-10-27 | 海信集团有限公司 | 一种用于语音输入中确定交互文本的方法和装置 |
CN109840287A (zh) * | 2019-01-31 | 2019-06-04 | 中科人工智能创新技术研究院(青岛)有限公司 | 一种基于神经网络的跨模态信息检索方法和装置 |
CN110017841A (zh) * | 2019-05-13 | 2019-07-16 | 大有智能科技(嘉兴)有限公司 | 视觉定位方法及其导航方法 |
CN112179330A (zh) * | 2020-09-14 | 2021-01-05 | 浙江大华技术股份有限公司 | 移动设备的位姿确定方法及装置 |
CN112270710A (zh) * | 2020-11-16 | 2021-01-26 | Oppo广东移动通信有限公司 | 位姿确定方法、位姿确定装置、存储介质与电子设备 |
CN113656546A (zh) * | 2021-08-17 | 2021-11-16 | 百度在线网络技术(北京)有限公司 | 多模态搜索方法、装置、设备、存储介质以及程序产品 |
CN113532442A (zh) * | 2021-08-26 | 2021-10-22 | 杭州北斗时空研究院 | 一种室内ar行人导航方法 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118258406A (zh) * | 2024-05-29 | 2024-06-28 | 浙江大学湖州研究院 | 一种基于视觉语言模型的自动导引车导航方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
CN117333638A (zh) | 2024-01-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7361075B2 (ja) | デバイスローカリゼーションのためのマルチシンクアンサンブルモデル | |
US11244189B2 (en) | Systems and methods for extracting information about objects from scene information | |
US10297070B1 (en) | 3D scene synthesis techniques using neural network architectures | |
Rahman et al. | Notice of violation of IEEE publication principles: Recent advances in 3D object detection in the era of deep neural networks: A survey | |
Kostavelis et al. | Semantic mapping for mobile robotics tasks: A survey | |
US11222044B2 (en) | Natural language image search | |
US10810466B2 (en) | Method for location inference from map images | |
Lu et al. | Localize me anywhere, anytime: a multi-task point-retrieval approach | |
Alam et al. | A review of recurrent neural network based camera localization for indoor environments | |
Cheng et al. | Hierarchical visual localization for visually impaired people using multimodal images | |
WO2023246537A1 (zh) | 导航、视觉定位以及导航地图构建方法和电子设备 | |
Gee et al. | A topometric system for wide area augmented reality | |
JP7430254B2 (ja) | 場所認識のための視覚的オブジェクトインスタンス記述子 | |
Sharma et al. | Navigation in AR based on digital replicas | |
Hu et al. | Computer vision for sight: Computer vision techniques to assist visually impaired people to navigate in an indoor environment | |
Bigazzi et al. | Embodied navigation at the art gallery | |
Zhao et al. | Place recognition with deep superpixel features for brain-inspired navigation | |
Chen et al. | To Know Where We Are: Vision-Based Positioning in Outdoor Environments | |
Zhao et al. | A mobile landmarks guide: Outdoor augmented reality based on LOD and contextual device | |
Nair | A voting algorithm for dynamic object identification and pose estimation | |
Ma et al. | A Method to Build Multi-Scene Datasets for CNN for Camera Pose Regression | |
SILVA | Unsupervised multi-view multi-person 3D pose estimation | |
Khan | Self localisation in indoor environments using machine vision | |
Hu et al. | Computer Vision for Sight | |
Orlando | Image-Base Localization for Indoor Environments and Techniques of Domain Adaptation from Virtual to Real Data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23826186 Country of ref document: EP Kind code of ref document: A1 |