[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2024012333A1 - 位姿估计方法及相关模型的训练方法、装置、电子设备、计算机可读介质和计算机程序产品 - Google Patents

位姿估计方法及相关模型的训练方法、装置、电子设备、计算机可读介质和计算机程序产品 Download PDF

Info

Publication number
WO2024012333A1
WO2024012333A1 PCT/CN2023/105934 CN2023105934W WO2024012333A1 WO 2024012333 A1 WO2024012333 A1 WO 2024012333A1 CN 2023105934 W CN2023105934 W CN 2023105934W WO 2024012333 A1 WO2024012333 A1 WO 2024012333A1
Authority
WO
WIPO (PCT)
Prior art keywords
pose
sample
target
image
pixel point
Prior art date
Application number
PCT/CN2023/105934
Other languages
English (en)
French (fr)
Inventor
周晓巍
林浩通
彭思达
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2024012333A1 publication Critical patent/WO2024012333A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/74Determining position or orientation of objects or cameras using feature-based methods involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • G06T2207/10012Stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • the present disclosure relates to but is not limited to the field of artificial intelligence technology, and in particular, to a pose estimation method and related model training methods, devices, electronic equipment, computer-readable media and computer program products.
  • cameras can be used to capture images of objects that need to be positioned, and then network models are used to process the captured images to obtain the pose of the object.
  • Embodiments of the present disclosure provide at least a pose estimation method and related model training methods, devices, electronic devices, computer-readable media, and computer program products.
  • Embodiments of the present disclosure provide a training method for a pose estimation model, which includes: obtaining a sample image containing an object to be located, the sample image including a sample color image and a sample depth image corresponding to the sample color image; using the pose estimation model to Color image processing to obtain the sample initial pose of the object to be located; based on the depth information of the object to be located in the sample depth image, the initial pose of the sample is optimized to obtain the optimized pose of the object to be located; based on the optimized pose and the initial sample The difference between poses adjusts the network parameters in the pose estimation model.
  • the pose estimation model is used to process the sample color image to obtain the sample initial pose of the object to be located, and then the sample depth image is used to optimize the sample initial pose.
  • the pose estimation process of the object to be located not only utilizes the color, texture, contour and other features in the sample color image, but also utilizes the depth features in the sample depth image, making the optimized pose of the object to be located more accurate.
  • the difference between the optimized pose and the sample pose is used to adjust the network parameters in the pose estimation model. There is no need to label the sample color image, which reduces the labeling workload and improves the training efficiency of the pose estimation model.
  • Embodiments of the present disclosure also provide a pose estimation method, including: acquiring a target image containing an object to be located, where the target image includes a target color image and a target depth image corresponding to the target color image; using a pose estimation model to estimate the target color image Perform processing to obtain the target initial pose of the object to be located; based on the depth information of the object to be located in the target depth image, optimize the initial pose of the target to obtain the target pose of the object to be located; among them, the image detection model uses the above It is trained by the training method of the pose estimation model.
  • Embodiments of the present disclosure also provide a training device for a pose estimation model, including: a sample image acquisition part configured to acquire a sample image containing an object to be located, where the sample image includes a sample color image and a sample depth corresponding to the sample color image. image; the sample pose estimation part is configured to use the pose estimation model to process the sample color image to obtain the sample initial pose of the object to be located; the sample pose optimization part is configured to based on the sample depth image of the object to be located Depth information is used to optimize the initial pose of the sample to obtain the optimized pose of the object to be located; the parameter adjustment part is configured to adjust the network parameters in the pose estimation model based on the difference between the optimized pose and the initial pose of the sample. .
  • Embodiments of the present disclosure also provide a pose estimation device, including: a target image acquisition part configured to acquire a target image containing an object to be located, where the target image includes a target color image and a target depth image corresponding to the target color image;
  • the pose estimation part is configured to use the pose estimation model to process the target color image to obtain the target initial pose of the object to be located;
  • the target pose optimization part is configured to be based on the depth information of the object to be located in the target depth image , the initial pose of the target is optimized to obtain the target pose of the object to be located; among which, the image detection model is trained using the training device of the above pose estimation model.
  • Embodiments of the present disclosure also provide an electronic device, including a memory and a processor coupled to each other.
  • the processor is configured to execute program instructions stored in the memory to implement the above training method of the pose estimation model, or to implement the above image Detection method.
  • Embodiments of the present disclosure also provide a computer-readable storage medium on which program instructions are stored.
  • program instructions When the program instructions are executed by a processor, the above training method for the pose estimation model is implemented, or the above image detection method is implemented.
  • Embodiments of the present disclosure also provide a computer program product.
  • the computer program product includes a non-transitory computer-readable storage medium storing a computer program. When the computer program is read and executed by a computer, part of the above method is implemented. or all steps.
  • Figure 1 is a schematic flowchart of a training method for a pose estimation model provided by an embodiment of the present disclosure
  • FIG. 2 is a sub-flow schematic diagram of step S13 in the flow diagram shown in Figure 1;
  • Figure 3 is a schematic flowchart of a training method for a pose estimation model provided by an embodiment of the present disclosure
  • Figure 4 is a schematic flowchart of a pose estimation method provided by an embodiment of the present disclosure
  • Figure 5 is a schematic structural diagram of a training device for a pose estimation model provided by an embodiment of the present disclosure
  • Figure 6 is a schematic structural diagram of a pose estimation device provided by an embodiment of the present disclosure.
  • Figure 7 is a block diagram of an electronic device provided by an embodiment of the present disclosure.
  • Figure 8 is a block diagram of a computer-readable storage medium provided by an embodiment of the present disclosure.
  • a and/or B can mean: A exists alone, A and B exist simultaneously, and they exist alone. B these three situations.
  • the character "/" in this article generally indicates that the related objects are an "or” relationship.
  • "many” in this article means two or more than two.
  • the term "at least one” herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, and C, which can mean including from A, Any one or more elements selected from the set composed of B and C.
  • inventions of the present disclosure provide a training method for a pose estimation model.
  • the execution subject of the training method may be a training device for the pose estimation model.
  • the training device for the pose estimation model may be any type of device capable of executing The terminal device or server or other processing device of the method of the embodiment of the present disclosure, wherein the terminal device can be a visual positioning device, user equipment (User Equipment, UE), mobile device, user terminal, terminal, cellular phone, cordless phone, personal Digital processing (Personal Digital Assistant, PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc.
  • the training method of the pose estimation model can be implemented by the processor calling computer readable instructions stored in the memory.
  • Figure 1 is a schematic flowchart of a training method for a pose estimation model provided by an embodiment of the present disclosure.
  • the method may include steps S11 to S14:
  • Step S11 Obtain a sample image containing the object to be located, where the sample image includes a sample color image and a sample depth image corresponding to the sample color image.
  • the sample image can be a real image or a synthetic image.
  • sample images may include part real images and part synthetic images.
  • the way to obtain the sample image containing the object to be located may be to photograph the object to be located by an execution device that executes the training method of the pose estimation model provided by the embodiment of the present disclosure, or by other methods. After the device takes a picture of the object to be positioned, it transmits the captured image to the execution device through a communication connection.
  • a published image data set for pose estimation can be used as a sample image.
  • the pixel value of each pixel in the sample depth image is used to represent the depth value of the corresponding pixel in the sample color image.
  • the depth value may be the distance between the three-dimensional point on the photographed object corresponding to the pixel point and the photographing device.
  • Step S12 Use the pose estimation model to process the sample color image to obtain the sample initial pose of the object to be located.
  • the pose estimation model may be a pre-trained model or a non-pre-trained model.
  • the pose estimation model can be a residual network (ResNet network) or any other network with any structure. Among them, the pose estimation model can directly output the sample initial pose of the object to be located, or the pose estimation model can output an intermediate result, and then other models or networks can further process the intermediate result to obtain the object to be located. Sample initial pose.
  • ResNet network residual network
  • the sample initial pose may be a six-degree-of-freedom pose, that is, the sample initial pose includes the position and orientation of the object to be located in the camera coordinate system.
  • Step S13 Based on the depth information of the object to be located in the sample depth image, the initial pose of the sample is optimized to obtain the optimized pose of the object to be located.
  • the pixel value of each pixel in the sample depth image is used to represent the depth value of the corresponding pixel in the sample color image, where the depth value can be the difference between the three-dimensional point on the photographed object corresponding to the pixel and the shooting device. distance between.
  • the sample color image is a two-dimensional image, it can only reflect the color, texture and other characteristics of the object to be located, but cannot well reflect the distance between the object to be located and the shooting device, resulting in the sample initialization obtained from the sample color image.
  • the pose may not be accurate. Therefore, by combining the depth information of the object to be located in the sample depth image, the initial pose of the sample is optimized so that the optimized pose can reflect more information and more information related to the object to be located. for accuracy.
  • Step S14 Adjust the network parameters in the pose estimation model based on the difference between the optimized pose and the sample initial pose.
  • the loss can be determined based on the difference between the optimized pose and the initial pose of the sample, and then the loss can be used to adjust the network parameters in the pose estimation model.
  • the pose estimation model is used to process the sample color image to obtain the sample initial pose of the object to be located, and then the sample depth image is used to optimize the sample initial pose.
  • the pose estimation process of the object to be located is It not only uses the color, texture, contour and other features in the sample color image, but also uses The depth features in the sample depth image are removed, making the optimized pose of the object to be located more accurate.
  • the network parameters in the adjusted pose estimation model can be optimized without annotating the pose of the sample color image, and using the The difference between the annotated pose and the initial pose of the sample is used to adjust the network parameters in the pose estimation model, thereby reducing the annotation workload and improving the training efficiency of the pose estimation model.
  • the above step S12 may include the following steps S121 to S123:
  • Step S121 Use the pose estimation model to perform target detection on the sample color image to obtain the position of the object to be located.
  • the pose estimation model includes a target detection sub-network, and the target detection sub-network is configured to perform target detection on the sample color image and obtain the position of the object to be located in the sample color image.
  • the target detection sub-network and the pose estimation model can also be independent of each other, that is, first use the target detection sub-network to perform target detection on the sample color image to obtain the position of the object to be located, and then use the pose The estimation model processes the sample color image based on the detection results of the target detection subnetwork.
  • Step S122 Based on the position of the object to be located in the sample color image, the sample color image is cropped to obtain a partial image containing the object to be located.
  • the cropping method may be to extend the area where the object to be located on the sample color image is located by a preset scale and then use the cropped partial image containing the object to be located as the partial image.
  • Step S123 Process the local image to obtain the sample initial pose of the object to be located.
  • the target detection sub-network is first used to perform target detection on the sample color image to obtain the location of the object to be located. Then, the sample color image is cropped to obtain a partial image containing the object to be located. Finally, the partial image is obtained by The image is processed to obtain the initial pose of the sample of the object to be located. In this way, the background part in the processed partial image can be reduced, thereby reducing the interference of the background in the synthetic image on the processing results and improving the estimation accuracy of the initial pose of the sample.
  • step S12 may also include the following steps S124 to S125:
  • Step S124 Use the pose estimation model to determine the projection position of at least one three-dimensional key point of the object to be located on the sample color image.
  • the sample color image can be cropped to obtain a partial image containing the object to be positioned. Determining the projection position of at least one three-dimensional key point of the object to be positioned on the sample color image can be determined by determining the position of each three-dimensional key point on the partial image. Projection position.
  • at least one three-dimensional key point of the object to be located may be extracted on a preset three-dimensional model corresponding to the object to be located. For example, at least one three-dimensional key point may be a set of three-dimensional points obtained from a preset three-dimensional model through a farthest point sampling algorithm.
  • Step S125 Determine the sample initial pose of the object to be located based on the projection position of each three-dimensional key point on the sample color image and the internal parameters of the target camera.
  • the initial pose of the sample of the object to be located can be determined by solving the PnP (Perspective-n-Point) problem.
  • the intrinsic parameters of the target camera may include parameters such as focal length.
  • the method of obtaining the sample initial pose of the object to be located by solving the PnP problem will not be described in detail here.
  • the target camera may be a camera that collects sample images, the same as the above-mentioned shooting device.
  • the projection position of the three-dimensional key point of the object to be positioned on the sample color image can be determined through the pose estimation model, and then, based on the determined projection position of the three-dimensional key point and the internal parameters of the target camera, the position to be positioned is obtained Sample initial pose of the object.
  • step S124 may include steps S1241 to S1243:
  • Step S1241 Use the pose estimation model to predict the direction vectors of each object pixel to each projection position.
  • the object pixels are the pixels belonging to the object to be located in the sample color image.
  • the pose estimation model can perform target detection on the sample color image and obtain the position of the object to be located in the sample color image.
  • the semantic labels of pixels belonging to the object to be located are set to a first preset value (for example, the first preset value may be 1), and the semantic labels of pixels that do not belong to the object to be located are set to a second preset value.
  • Set a value for example, the second preset value may be 0
  • the direction vector may be a two-dimensional vector, where one dimension is a vector in the x-axis direction of the sample color image (for example, a vector in the horizontal axis direction of the sample color image), and the other dimension may be the y-axis direction of the sample color image.
  • vector on for example, a vector along the vertical axis of the sample color image.
  • the direction vector from the object pixel to the projection position can refer to formula (1):
  • v k (p) represents the direction vector from the object pixel point p to the k-th projection position
  • x k represents the two-dimensional coordinates of the k-th projection position
  • p represents the two-dimensional coordinates of the position of the pixel point p.
  • a preset number of direction vectors is determined from at least one direction vector corresponding to the projection position, and candidate projection positions corresponding to each direction vector are generated. For example, the position of each object pixel point and the direction vector corresponding to the object pixel point are summed to obtain the candidate projection position corresponding to each object pixel point. For example, if there are 10 object pixels belonging to the object to be located in the sample color image, and the default number is 5, that is, 5 direction vectors can be selected from the 10 direction vectors, and each direction vector has a corresponding object pixel. Point, each object pixel point and its corresponding direction vector are added to obtain the candidate projection position corresponding to the object pixel point, that is, the obtained candidate projection positions are 5.
  • the input of the pose estimation model is at least one sample color image
  • the output result is the semantic label corresponding to each pixel and the direction vector corresponding to each pixel.
  • the semantic label is used to indicate whether the pixel belongs to the object to be located.
  • several sample color images input can be To include a variety of objects to be located, the semantic label corresponding to each output pixel may be the label of which object to be located the pixel belongs to.
  • various objects to be located may include cups, tables, stools, etc. That is, the pose estimation model obtained according to the pose estimation model training method provided by the embodiments of the present disclosure can estimate the poses of multiple objects to be located at the same time, and obtain the target pose of each object to be located.
  • Step S1242 Based on the direction vector and the corresponding object pixel point, the method for determining the corresponding candidate projection position can refer to formula (2):
  • Step S1243 Determine the score of each candidate projection position based on the positional relationship between the candidate projection positions, and use the candidate projection position whose score meets the preset requirements as the projection position.
  • the above method of determining the score of each candidate projection position based on the positional relationship between the candidate projection positions may be: for each candidate projection position, determine the target distance between the candidate projection position and other candidate projection positions.
  • the number of target distances is used as the score of the current candidate projection position.
  • the target distance is a distance less than or equal to the preset distance.
  • the distance between the current candidate projection position and other candidate projection positions is obtained by making a difference between the current candidate projection position and other candidate projection positions.
  • the size of the preset distance can be adjusted during the training process of the pose estimation model to determine the final preset distance.
  • the way to calculate the score of each candidate projection position can refer to formula (3):
  • w k,i represents the score of the i-th candidate projection position
  • I is an indicator function, which is 1 if the condition is met, and 0 if the condition is not met
  • is the preset distance, for example, ⁇ can take the value of 1.
  • the method of using the candidate projection position whose score meets the preset requirement as the projection position may be: using the candidate projection position corresponding to the maximum score as the projection position.
  • the method of determining the projection positions corresponding to other three-dimensional key points can refer to the above process.
  • the final projection position is determined, making the determined projection position more accurate.
  • the training method of the pose estimation model provided by the embodiments of the present disclosure may also include a pre-training step for the pose estimation model.
  • the pre-training step may be: obtaining several sample images, which may be the same as or different from the sample images obtained in step S11. Obtain the sample semantic label and sample projection position of each pixel on the sample image. Determine the first loss between the semantic label output by the pose estimation model and the sample semantic label, determine the projection position based on the direction vector of each object pixel output by the pose estimation model, and determine the distance between the projection position and the sample projection position. Second loss, based on the first loss and second loss, adjusted position Network parameters in the pose estimation model.
  • the initial learning rate can be set to 1e-3, and the learning rate is halved after every first predetermined number of iterations.
  • the learning rate can be adjusted to 5e-4, and after every second predetermined number of iterations, the learning rate is halved.
  • the first predetermined number of iterations is twice the second predetermined number of iterations.
  • At least one candidate projection position is determined based on the direction vector of each object pixel point with respect to the projection position, and then a candidate projection position that meets the requirements is selected from the at least one candidate projection position as the final projection position. , making the obtained projection position more accurate.
  • FIG 2 is a schematic sub-flow diagram of step S13 in the training method of the pose estimation model shown in Figure 1.
  • the above step S13 may include steps S131 to S133:
  • Step S131 Determine a rendering depth map of the object to be positioned based on the initial pose of the sample and the preset three-dimensional model corresponding to the object to be positioned.
  • the preset three-dimensional model may be drawn using drawing software, or may be obtained by using a modeling network to perform three-dimensional modeling of the object to be positioned using at least one image containing the object to be positioned.
  • the initial pose of the sample can be considered as the pose of the preset three-dimensional model corresponding to the object to be located in the camera coordinate system.
  • the above-mentioned method of determining the rendering depth map of the object to be positioned based on the initial pose of the sample and the preset three-dimensional model corresponding to the object to be located may be based on the initial pose of the sample, projecting the preset three-dimensional model onto the camera plane, and obtaining the Render a depth map.
  • Step S132 Use the difference between the rendered depth map and the sample depth image to determine the optimization item.
  • the training method of the pose estimation model may also include the following steps: determining a normal map of the object to be located based on the initial pose of the sample and the preset three-dimensional model.
  • the preset three-dimensional model may be composed of at least one plane (for example, a triangular mesh surface). Each plane corresponds to a pixel value of a pixel in the normal map, and the pixel value may be used to represent the plane. Normal direction.
  • the way to determine the optimization item can be:
  • the first point cloud includes a first three-dimensional point corresponding to at least one object pixel point
  • the second point cloud includes a second three-dimensional point corresponding to each object pixel point.
  • the object pixels are the pixels belonging to the object to be located in the sample color image.
  • the sample initial pose of the object to be located is used to back-project the rendered depth image to obtain the first point cloud; the sample initial pose of the object to be located is used to back-project the sample depth image to obtain the third point cloud. Two point cloud.
  • the deviation representation value corresponding to the object pixel point is determined.
  • the deviation representation value can be the residual.
  • the deviation representation value corresponding to the object pixel point is the product of the target pose difference corresponding to the object pixel point and the corresponding normal direction of the object pixel point in the normal map.
  • the target pose difference is the pose difference between the first three-dimensional point corresponding to the object pixel point and the corresponding second three-dimensional point.
  • ⁇ -1 is the back-projection function
  • D r (p) represents the depth value of the object pixel point p in the rendering depth map
  • D (p) represents the depth value of the object pixel point p in the sample depth map
  • N r ( p) represents the normal direction corresponding to the object pixel point p in the normal map
  • ⁇ -1 (D r (p))- ⁇ -1 (D(p)) represents the pose difference between the first three-dimensional point and the corresponding second three-dimensional point.
  • the deviation representation value of the object pixel point may be the minimum distance from the second three-dimensional point to the plane of the first three-dimensional point, and the plane is defined by the first three-dimensional point and its normal.
  • the optimization terms are determined based on the deviation representation values corresponding to each object pixel. For example, the sum of the deviation characterization values, the average value of the deviation characterization values, or the maximum value among the deviation characterization values is used as the optimization term.
  • the gradient descent method is used to minimize these residual terms to obtain the desired optimized pose.
  • a set of hypothetical sample initial poses can be generated by perturbing the sample initial pose. Then, these poses are optimized to obtain more accurate optimized poses.
  • Step S133 Adjust the initial pose of the sample so that the optimization item meets the preset requirements, and use the adjusted initial pose of the sample as the optimized pose.
  • the preset requirement may be to minimize the optimization term.
  • a rendering depth map of the object to be positioned is determined, and then based on the relationship between the rendering depth map of the object to be positioned and the sample depth image Differences are constructed to construct optimization terms corresponding to the pixels of each object, and the optimization terms are used to adjust the initial pose of the sample to make the adjusted initial pose of the sample more accurate.
  • the first point cloud and the second point cloud are obtained respectively, as well as the three-dimensional points and the second point cloud based on the first point cloud.
  • the difference between the three-dimensional points in the object is combined with the normal direction of each point to obtain the deviation representation value corresponding to the object pixel point, making the determined deviation representation value more accurate.
  • step S13 and before step S14 the following steps are also included:
  • the optimized pose in response to the optimized pose being a preset incorrectly estimated pose, the optimized pose is discarded, and adjusting the pose estimation model based on the difference between the optimized pose and the sample initial pose is not performed. network parameters.
  • the error estimation of the pose estimation model can be reduced. perturbation to improve the accuracy of the final pose estimation model.
  • the way to determine whether the optimized pose is a preset wrong estimated pose can be:
  • the object pixels are pixels belonging to the object to be located in the sample color image
  • the deviation representation value corresponding to the object pixel is the product of the target pose difference corresponding to the object pixel and the normal direction corresponding to the object pixel.
  • the target pose difference is the pose difference between the first three-dimensional point corresponding to the object pixel point and the second three-dimensional point, where the first three-dimensional point is the three-dimensional point cloud corresponding to the rendering depth map of the object to be located. point, and the second three-dimensional point is a three-dimensional point in the second point cloud corresponding to the sample depth image of the object to be located.
  • the first point cloud is obtained by back-projecting the rendered depth image
  • the second point cloud is obtained by back-projecting the sample depth image.
  • the central tendency representation value is the average of the deviation representation values corresponding to each object pixel point.
  • the preset size is related to the size of the object to be positioned in the physical world. For example, the preset size can be 0.2 times the length of the object to be positioned.
  • the optimized pose is not a preset incorrectly estimated pose when the central tendency representation value is not larger than the preset size
  • the optimized pose can be filtered based on the physical size of the object to be located.
  • FIG. 3 is a schematic flowchart of the training method of the pose estimation model provided by the embodiment of the present disclosure.
  • the sample image data includes at least one sample color image and at least one sample depth image corresponding to the at least one sample color image
  • the pose estimation model uses the pose estimation model to predict the initial pose of the object to be located in at least one sample color image; then, use the depth information in at least one sample depth image to optimize the pose of the estimated initial pose of the sample, where
  • the pose optimization method can be iterative optimization; thirdly, the pose obtained by the optimization is evaluated, that is, it is judged whether the pose obtained by the optimization is a preset error estimate, and the optimized pose is filtered based on the evaluation results, and the wrong pose is discarded. Estimation; finally, determine the difference between the retained optimized pose and the initial pose of the sample, and adjust the network parameters of the pose estimation model based on the difference.
  • the pose estimation model provided by embodiments of the present disclosure can estimate the 6D pose of the object to be located using a sample color image.
  • the training method of the pose estimation model provided by the embodiments of the present disclosure can be applied to the field of augmented reality applications.
  • FIG. 4 is a schematic flowchart of a pose estimation method provided by an embodiment of the present disclosure.
  • the pose estimation method provided by the embodiment of the present disclosure may include steps S21 to S23:
  • Step S21 Obtain a target image containing the object to be located.
  • the target image includes a target color image and a target depth image corresponding to the target color image.
  • the target image containing the object to be located may be captured by the execution device of the pose estimation method, or may be captured by other devices that establish communication connections with the execution device.
  • Step S22 Use the pose estimation model to process the target color image to obtain the target initial pose of the object to be located.
  • the method of obtaining the initial pose of the target may refer to the method of obtaining the initial pose of the sample in the above embodiment of the training method of the pose estimation model.
  • the pose estimation model is obtained using the training method provided by the above training method embodiment of the pose estimation model.
  • Step S23 Based on the depth information of the object to be located in the target depth image, the initial pose of the target is optimized to obtain the target pose of the object to be located.
  • the method of obtaining the target pose of the object to be located may refer to the method of obtaining the optimized pose in the above training method embodiment of the pose estimation model.
  • the pose estimation model is used to process the target color image to obtain the target initial pose of the object to be located, and then the target depth image is used to optimize the target initial pose, so that ,
  • the pose estimation process of the object to be located not only utilizes the color, texture, contour and other features in the target color image, but also utilizes the depth features in the target depth image, making the optimized target pose of the object to be located more accurate.
  • the pose estimation method provided by the embodiments of the present disclosure can be applied to the field of augmented reality applications.
  • the execution subject of the pose estimation method provided by the embodiments of the present disclosure may be a pose estimation device, and the pose estimation device may be any terminal capable of executing the pose estimation method provided by the embodiments of the present disclosure. equipment, servers or other processing equipment.
  • the terminal device can be an augmented reality display device, a visual positioning device, a user equipment (UE), a mobile device, a user terminal, a terminal, a cellular phone, a cordless phone, a personal digital assistant (PDA), a handheld devices, computing devices, in-vehicle devices, wearable devices, etc.
  • the pose estimation method can be implemented by a processor calling computer-readable instructions stored in a memory.
  • each step does not imply a strict execution order and does not constitute any limitation on the implementation process.
  • the execution order of each step should be based on its function and possible The internal logic is determined.
  • FIG. 5 is a schematic structural diagram of a training device for a pose estimation model provided by an embodiment of the present disclosure.
  • the training device 50 of the pose estimation model includes a sample image acquisition part 51 , a sample pose estimation part 52 , a sample pose optimization part 53 and a parameter adjustment part 54 .
  • the sample image acquisition part 51 is configured to acquire a sample image containing the object to be located, and the sample image includes a sample color image and a sample depth image corresponding to the sample color image;
  • the sample pose estimation part 52 is configured to use pose estimation The model processes the sample color image to obtain the sample initial pose of the object to be located;
  • the sample pose optimization part 53 It is configured to optimize the initial pose of the sample based on the depth information of the object to be located in the sample depth image to obtain the optimized pose of the object to be located;
  • the parameter adjustment part 54 is configured to be based on the combination of the optimized pose and the initial pose of the sample. The difference between them adjusts the network parameters in the pose estimation model.
  • the pose estimation model is used to process the sample color image to obtain the sample initial pose of the object to be located, and then the sample depth image is used to optimize the sample initial pose.
  • the pose estimation process of the object to be located is It not only utilizes the color, texture, contour and other features in the sample color image, but also utilizes the depth features in the sample depth image to make the optimized pose of the object to be located more accurate.
  • the network parameters in the pose estimation model are adjusted without labeling the sample color image, which reduces the labeling workload and improves the training of the pose estimation model. efficiency.
  • the sample pose optimization part 53 optimizes the sample initial pose based on the depth information of the object to be located in the sample depth image, and obtains the optimized pose of the object to be located, including: based on the sample initial pose and the undetermined Use the preset three-dimensional model corresponding to the object to determine the rendering depth map of the object to be positioned; use the difference between the rendering depth map and the sample depth image to determine the optimization items; adjust the initial pose of the sample so that the optimization items meet the preset requirements , and use the adjusted sample initial pose as the optimized pose.
  • the rendering depth map of the object to be positioned is determined based on the initial pose of the sample and the preset three-dimensional model corresponding to the object to be positioned, and then an optimization term is constructed based on the difference between the rendered depth map and the sample depth image, using The optimization item adjusts the initial pose of the sample to make the adjusted initial pose of the sample more accurate.
  • the preset requirement is to minimize the optimization term; and/or, the sample pose optimization part 53 is also configured to: determine the normal map of the object to be positioned based on the sample initial pose and the preset three-dimensional model. ; and, the sample pose optimization part 53 uses the difference between the rendered depth map and the sample depth image to determine the optimization items, including: back-projecting the rendered depth map and the sample depth image respectively to obtain the first point corresponding to the rendered depth map
  • the cloud and the second point cloud corresponding to the sample depth image The first point cloud includes at least one first three-dimensional point corresponding to the object pixel point.
  • the second point cloud includes the second three-dimensional point corresponding to each object pixel point.
  • the object pixel point is the pixel point belonging to the object to be located in the sample color image; for each object pixel point, determine the deviation representation value corresponding to the object pixel point, and the deviation representation is the target pose difference corresponding to the object pixel point and the object pixel point in the normal map
  • the product between the corresponding normal directions in where the target pose difference is the pose difference between the first three-dimensional point corresponding to the object pixel point and the corresponding second three-dimensional point; combined with the deviation representation corresponding to each object pixel point value to determine the optimization item.
  • the first point cloud and the second point cloud are obtained by back-projecting the rendering depth map and the sample depth image. Then, based on the difference between the three-dimensional points in the first point cloud and the three-dimensional points in the second point cloud, The normal directions of each point are collected to determine the deviation representation value corresponding to each object pixel point, making the determined deviation representation value more accurate.
  • the pose estimate is adjusted based on the difference between the optimized pose and the sample initial pose.
  • the adjustment part 54 is also configured to: determine whether the optimized pose is the preset error estimated pose; in response to the optimized pose not being the preset error estimated pose, perform an initialization based on the optimized pose and the sample The difference between poses, the step of adjusting the network parameters in the pose estimation model.
  • the adjustment part 54 determines whether the optimized pose is a preset wrong estimated pose, including: obtaining the central tendency representation value between the deviation representation values corresponding to each object pixel point, where the object pixel point is a sample color
  • the deviation representation value corresponding to the object pixel is the product of the target pose difference corresponding to the object pixel and the normal direction corresponding to the object pixel
  • the target pose difference is the object pixel
  • the pose difference between the corresponding first three-dimensional point and the corresponding second three-dimensional point is the three-dimensional point in the first point cloud corresponding to the rendering depth image
  • the second three-dimensional point is the second three-dimensional point corresponding to the sample depth image.
  • Three-dimensional points in the point cloud determine whether the central tendency representation value is less than or equal to the preset size, which is related to the size of the object to be located in the physical world; determine the optimization in response to the central tendency representation value being less than or equal to the preset size
  • the pose is not the default incorrectly estimated pose.
  • the central tendency representation value is the average of the deviation representation values corresponding to each object pixel point.
  • the sample pose estimation part 52 uses the pose estimation model to process the sample color image to obtain the sample initial pose of the object to be located, including: using the pose estimation model to determine at least one three-dimensional image of the object to be located. The projection position of the key points on the sample color image; based on the projection position of each three-dimensional key point on the sample color image and the internal parameters of the target camera, determine the sample initial pose of the object to be positioned.
  • the projection position of the three-dimensional key points of the object to be positioned on the sample color image can be determined, and thus the object to be positioned can be obtained based on the determined projection position of the three-dimensional key points and the internal parameters of the target camera.
  • Sample initial pose the projection position of the three-dimensional key points of the object to be positioned on the sample color image
  • the sample pose estimation part 52 uses the pose estimation model to determine the projection position of at least one three-dimensional key point of the object to be located on the sample color image, including: using the pose estimation model to predict the pixel points of each object Direction vectors to each projection position respectively.
  • the object pixels are pixels belonging to the object to be located in the sample color image; for each projection position, a preset number of direction vectors is determined from at least one direction vector corresponding to the projection position. , generate candidate projection positions corresponding to each direction vector; determine the score of each candidate projection position based on the positional relationship between the candidate projection positions; use the candidate projection position whose score meets the preset requirements as the projection position.
  • At least one candidate projection position is determined based on the direction vector of each pixel point with respect to the projection position, and then a candidate projection position that meets the requirements is selected from the at least one candidate projection position as the final The projection position makes the determined projection position more accurate.
  • the sample pose estimation part 52 determines a preset number of direction vectors from at least one direction vector corresponding to the projection position, and generates candidate projection positions corresponding to each direction vector, including: converting the position of each object pixel point The direction vectors corresponding to the object pixels are summed to obtain the candidate projection positions corresponding to each object pixel point; based on the positional relationship between the candidate projection positions, the score of each candidate projection position is determined, including: for each candidate projection position, Determine the number of target distances between the candidate projection position and other candidate projection positions, and use the number of target distances as a score.
  • the target distance is a distance less than or equal to the preset distance; use the candidate projection position whose score meets the preset requirements as
  • the projection position includes: taking the candidate projection position corresponding to the maximum score as the projection position.
  • the final projection position is determined by determining the distance between candidate projection positions, making the determined projection position more accurate.
  • the sample pose estimation part 52 uses the pose estimation model to process the sample color image to obtain the sample initial pose of the object to be located, including: using the pose estimation model to perform target detection on the sample color image to obtain the to-be-determined object. position of the object to be positioned; based on the position of the object to be positioned, the sample color image is cropped to obtain a partial image containing the object to be positioned; the partial image is processed to obtain the initial pose of the sample of the object to be positioned.
  • the sample color image is first subjected to target detection to obtain the position of the object to be located, and then the sample color image is cropped to obtain a partial image containing the object to be located, and a sample of the object to be located is obtained by processing the partial image.
  • the initial pose can reduce background interference, thereby improving the recognition accuracy of the initial pose of the sample.
  • FIG. 6 is a schematic structural diagram of a posture estimation device provided by an embodiment of the present disclosure.
  • the pose estimation device 60 includes a target image acquisition part 61 , a target pose estimation part 62 and a target pose optimization part 63 .
  • the target image acquisition part 61 is configured to acquire a target image containing the object to be located, where the target image includes a target color image and a target depth image corresponding to the target color image;
  • the target pose estimation part 62 is configured as The target color image is processed using a pose estimation model to obtain the target initial pose of the object to be located;
  • the target pose optimization part 63 is configured to be based on the depth of the object to be located in the target depth image Information, the initial pose of the target is optimized to obtain the target pose of the object to be located; wherein the pose estimation model is the pose estimation model provided in the embodiment of the training device using the above pose estimation model Obtained by training with the training device.
  • the pose estimation model is used to process the target color image to obtain the target initial pose of the object to be located, and then the target depth image is used to optimize the target initial pose.
  • the pose estimation process of the object to be located is It not only utilizes the color, texture, contour and other features in the target color image, but also utilizes the depth features in the target depth image, making the optimized target pose of the object to be located more accurate.
  • FIG. 7 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
  • the electronic device 70 includes a memory 71 and a processor 72 coupled to each other.
  • the processor 72 is configured to execute computer program instructions stored in the memory 71 to implement the steps of any of the above training method embodiments of the pose estimation model, or to implement The steps in any of the above embodiments of the pose estimation method.
  • the electronic device 70 may include but is not limited to: a microcomputer and a server.
  • the electronic device 70 may also include mobile devices such as laptop computers and tablet computers, which are not limited here.
  • the processor 72 is configured to control itself and the memory 71 to implement the steps in the training method embodiments of any of the above pose estimation models, or to implement the steps in any of the above pose estimation method embodiments. .
  • the processor 72 may also be called a CPU (Central Processing Unit).
  • the processor 72 may be an integrated circuit chip with signal processing capabilities.
  • the processor 72 can also be a general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field-Programmable Gate Array, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • the processor 72 may be implemented by an integrated circuit chip.
  • the pose estimation model is used to process the sample color image to obtain the sample initial pose of the object to be located, and then the sample depth image is used to optimize the sample initial pose.
  • the pose estimation process of the object to be located is It not only utilizes the color, texture, contour and other features in the sample color image, but also utilizes the depth features in the sample depth image to make the optimized pose of the object to be located more accurate.
  • the difference between the optimized pose and the sample pose is used to adjust the network parameters in the pose estimation model. There is no need to label the sample color image, which reduces the labeling workload and improves the training efficiency of the pose estimation model.
  • FIG. 8 is a schematic structural diagram of a computer-readable storage medium provided by an embodiment of the present disclosure.
  • the computer-readable storage medium 80 stores computer program instructions 801 that can be run by the processor.
  • the program instructions 801 are used to implement the steps in the training method embodiments of any of the above pose estimation models, or to implement any of the above pose estimation methods. Steps in Examples.
  • the pose estimation model is used to process the sample color image to obtain the sample initial pose of the object to be located, and then the sample depth image is used to optimize the sample initial pose.
  • the pose estimation process of the object to be located is It not only utilizes the color, texture, contour and other features in the sample color image, but also utilizes the depth features in the sample depth image to make the optimized pose of the object to be located more accurate.
  • the difference between the optimized pose and the sample pose is used to adjust the network parameters in the pose estimation model. There is no need to label the sample color image, which reduces the labeling workload and improves the training efficiency of the pose estimation model.
  • the functions or modules included in the device provided by the embodiments of the present disclosure can be used to execute the method described in the above method embodiments, and for its implementation, refer to the description of the above method embodiments.
  • the disclosed methods and devices can be implemented in other ways.
  • the device implementation described above is only illustrative.
  • the division of modules or units is only a logical function division.
  • there may be other divisions for example, units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
  • each functional unit in various embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above integrated units can be implemented in the form of hardware or software functional units.
  • Integrated units may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as independent products.
  • the technical solution of the embodiments of the present disclosure is essentially or contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage
  • the medium includes several instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the methods of various embodiments of the present disclosure.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. .
  • the products applying the disclosed technical solution will clearly inform the personal information processing rules and obtain the individual's independent consent before processing personal information.
  • the product applying the disclosed technical solution must obtain the individual's separate consent before processing the sensitive personal information, and at the same time meet the requirement of "express consent”. For example, setting up clear and conspicuous signs on personal information collection devices such as cameras to inform them that they have entered the scope of personal information collection and that personal information will be collected.
  • personal information processing rules may include personal information processing rules.
  • Information such as information processors, purposes of processing personal information, methods of processing, and types of personal information processed.
  • Embodiments of the present disclosure provide a pose estimation method and a training method, device, electronic device, computer readable medium, and computer program product for a related model, wherein the training method for the pose estimation model includes: obtaining an object containing an object to be located.
  • the sample image contains a sample color image and a sample depth image corresponding to the sample color image; the pose estimation model is used to process the sample color image to obtain a sample of the object to be located.
  • Initial pose based on the depth information of the object to be located in the sample depth image, optimize the initial pose of the sample to obtain the optimized pose of the object to be located; adjust the pose based on the difference between the optimized pose and the initial pose of the sample Estimate network parameters in the model.
  • the pose estimation model is used to process the sample color image to obtain the sample initial pose of the object to be located, and then the sample depth image is used to optimize the sample initial pose, so that the pose estimation process of the object to be located takes advantage of both Features such as color, texture, and contour in the sample color image, and depth features in the sample depth image, make the optimized pose of the object to be located more accurate.
  • the difference between the optimized pose and the sample pose is used to adjust the network parameters in the pose estimation model. There is no need to label the sample color image, which reduces the labeling workload and improves the training efficiency of the pose estimation model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

本公开公开了一种位姿估计方法及相关模型的训练方法、装置、电子设备、计算机可读存储介质和计算机程序产品,其中,位姿估计模型的训练方法,包括:获取包含待定位对象的样本图像,样本图像包含样本彩色图像和样本彩色图像对应的样本深度图像;利用位姿估计模型对样本彩色图像处理,得到待定位对象的样本初始位姿;基于样本深度图像中待定位对象的深度信息,对样本初始位姿进行优化,得到待定位对象的优化位姿;基于优化位姿与样本初始位姿之间的差异,调整位姿估计模型中的网络参数。

Description

位姿估计方法及相关模型的训练方法、装置、电子设备、计算机可读介质和计算机程序产品
相关申请的交叉引用
本公开实施例基于申请号为202210823003.X、申请日为2022年07月12日、申请名称为“位姿估计方法及相关模型的训练方法、装置、设备、介质”的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本公开作为参考。
技术领域
本公开涉及但不限于人工智能技术领域,特别是涉及一种位姿估计方法及相关模型的训练方法、装置、电子设备、计算机可读介质和计算机程序产品。
背景技术
随着科技的发展,可以使用相机对需要定位的对象拍摄图像,然后利用网络模型对拍摄得到的图像进行处理,得到该对象的位姿。
发明内容
本公开实施例至少提供一种位姿估计方法及相关模型的训练方法、装置、电子设备、计算机可读介质和计算机程序产品。
本公开实施例提供了一种位姿估计模型的训练方法,包括:获取包含待定位对象的样本图像,样本图像包含样本彩色图像和样本彩色图像对应的样本深度图像;利用位姿估计模型对样本彩色图像处理,得到待定位对象的样本初始位姿;基于样本深度图像中待定位对象的深度信息,对样本初始位姿进行优化,得到待定位对象的优化位姿;基于优化位姿与样本初始位姿之间的差异,调整位姿估计模型中的网络参数。
根据本公开实施例的位姿估计模型的训练方法,利用位姿估计模型对样本彩色图像进行处理,得到待定位对象的样本初始位姿之后,再利用样本深度图像对样本初始位姿进行优化,这样,对待定位对象的位姿估计过程既利用了样本彩色图像中颜色、纹理、轮廓等特征,又利用了样本深度图像中的深度特征,使得优化后的待定位对象的优化位姿更为准确。并且,利用优化位姿和样本位姿之间的差异,调整位姿估计模型中的网络参数,无需对样本彩色图像进行标注,减少了标注工作量,提高了对位姿估计模型的训练效率。
本公开实施例还提供了一种位姿估计方法,包括:获取包含待定位对象的目标图像,目标图像包括目标彩色图像和目标彩色图像对应的目标深度图像;利用位姿估计模型对目标彩色图像进行处理,得到待定位对象的目标初始位姿;基于目标深度图像中待定位对象的深度信息,对目标初始位姿进行优化,得到待定位对象的目标位姿;其中,图像检测模型是利用上述位姿估计模型的训练方法训练得到的。
本公开实施例还提供了一种位姿估计模型的训练装置,包括:样本图像获取部分,被配置为获取包含待定位对象的样本图像,样本图像包含样本彩色图像和样本彩色图像对应的样本深度图像;样本位姿估计部分,被配置为利用位姿估计模型对样本彩色图像处理,得到待定位对象的样本初始位姿;样本位姿优化部分,被配置为基于样本深度图像中待定位对象的深度信息,对样本初始位姿进行优化,得到待定位对象的优化位姿;参数调整部分,被配置为基于优化位姿与样本初始位姿之间的差异,调整位姿估计模型中的网络参数。
本公开实施例还提供了一种位姿估计装置,包括:目标图像获取部分,被配置为获取包含待定位对象的目标图像,目标图像包括目标彩色图像和目标彩色图像对应的目标深度图像;目标位姿估计部分,被配置为利用位姿估计模型对目标彩色图像进行处理,得到待定位对象的目标初始位姿;目标位姿优化部分,被配置为基于目标深度图像中待定位对象的深度信息,对目标初始位姿进行优化,得到待定位对象的目标位姿;其中,图像检测模型是利用上述位姿估计模型的训练装置训练得到的。
本公开实施例还提供了一种电子设备,包括相互耦接的存储器和处理器,处理器被配置为执行存储器中存储的程序指令,以实现上述位姿估计模型的训练方法,或实现上述图像检测方法。
本公开实施例还提供了一种计算机可读存储介质,其上存储有程序指令,程序指令被处理器执行时实现上述位姿估计模型的训练方法,或实现上述图像检测方法。
本公开实施例还提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序被计算机读取并执行时,实现上述方法中的部分或全部步骤。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,而非限制本公开实施例。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。
图1是本公开实施例提供的一种位姿估计模型的训练方法的流程示意图;
图2是图1所示的流程示意图中的步骤S13的子流程示意图;
图3是本公开实施例提供的一种位姿估计模型的训练方法的流程示意图;
图4是本公开实施例提供的一种位姿估计方法的流程示意图;
图5是本公开实施例提供的一种位姿估计模型的训练装置的组成结构示意图;
图6是本公开实施例提供的一种位姿估计装置的组成结构示意图;
图7是本公开实施例提供的一种电子设备的框图;
图8为本公开实施例提供的一种计算机可读存储介质的框图。
具体实施方式
下面结合说明书附图,对本公开实施例的方案进行详细说明。
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、接口、技术之类的具体细节,以便透彻理解本公开实施例。
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。此外,本文中的“多”表示两个或者多于两个。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合,例如,包括A、B、C中的至少一种,可以表示包括从A、B和C构成的集合中选择的任意一个或多个元素。
相关技术中,为了获得用于定位图像中的对象的位姿的网络模型,需要使用大量带有标注的标签的样本图像对网络模型进行训练,而对样本图像标注标签的过程需要耗费巨大的工作量和时间,导致整个网络模型训练过程用时较长、训练效率较低。
基于此,本公开实施例提供了一种位姿估计模型的训练方法,该训练方法的执行主体可以是位姿估计模型的训练装置,该位姿估计模型的训练装置可以是任意一种能够执行本公开实施例的方法的终端设备或服务器或其它处理设备,其中,终端设备可以为视觉定位设备、用户设备(User Equipment,UE)、移动设备、用户终端、终端、蜂窝电话、无绳电话、个人数字处理(Personal Digital Assistant,PDA)、手持设备、计算设备、车载设备、可穿戴设备等。在一些可能的实现方式中,该位姿估计模型的训练方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。
请参阅图1,图1是本公开实施例提供的位姿估计模型的训练方法的流程示意图,该方法可以包括步骤S11至步骤S14:
步骤S11,获取包含待定位对象的样本图像,样本图像包含样本彩色图像和样本彩色图像对应的样本深度图像。
其中,样本图像可以是真实图像,也可以是合成图像。一些应用场景中,样本图像可以包括部分真实图像以及部分合成图像。在样本图像为真实图像的情况下,获取包含待定位对象的样本图像的方式可以是由执行本公开实施例提供的位姿估计模型的训练方法的执行设备对待定位对象进行拍摄,或者,由其他设备对待定位对象进行拍摄之后,将拍摄得到的图像通过通信连接的方式传输至执行设备。一些实施例中,可以使用公开的用于进行位姿估计的图像数据集作为样本图像。
另外,样本深度图像中各像素点的像素值用于表示样本彩色图像中对应像素点的深度值。其中,该深度值可以是该像素点对应的被拍摄对象上的三维点与拍摄设备之间的距离。
步骤S12,利用位姿估计模型对样本彩色图像处理,得到待定位对象的样本初始位姿。
其中,该位姿估计模型可以是经过预训练的模型,也可以是未经过预训练的模型。位姿估计模型可以是残差网络(ResNet网络),还可以是其他任意结构的网络。其中,可以由位姿估计模型直接输出待定位对象的样本初始位姿,还可以由位姿估计模型输出中间结果,然后由其他模型或网络等对该中间结果进行进一步处理,得到待定位对象的样本初始位姿。
其中,样本初始位姿可以是六自由度位姿,即,样本初始位姿包含待定位对象在相机坐标系下的位置和朝向。
步骤S13,基于样本深度图像中待定位对象的深度信息,对样本初始位姿进行优化,得到待定位对象的优化位姿。
如上述,样本深度图像中各像素点的像素值用于表示样本彩色图像中对应像素点的深度值,其中,该深度值可以是该像素点对应的被拍摄对象上的三维点与拍摄设备之间的距离。这里,因为样本彩色图像为二维图像,仅能反映待定位对象的颜色、纹理等特征,而无法很好地反映待定位对象与拍摄设备之间的距离,导致由样本彩色图像得到的样本初始位姿可能不太准确,因此,通过结合样本深度图像中待定位对象的深度信息,对样本初始位姿进行优化,使得优化后的优化位姿能够反映与待定位对象相关的更多信息、更为准确。
步骤S14,基于优化位姿与样本初始位姿之间的差异,调整位姿估计模型中的网络参数。
在一些实现方式中,可以基于优化位姿与样本初始位姿之间的差异,确定损失,然后利用该损失调整位姿估计模型中的网络参数。
上述方案中,利用位姿估计模型对样本彩色图像进行处理,得到待定位对象的样本初始位姿之后,再利用样本深度图像对样本初始位姿进行优化,这样,对待定位对象的位姿估计过程既利用了样本彩色图像中颜色、纹理、轮廓等特征,又利用 了样本深度图像中的深度特征,使得优化后的待定位对象的优化位姿更为准确。并且,使用优化位姿和样本初始位姿之间的差异,即可实现对调整位姿估计模型中的网络参数的优化,而无需对样本彩色图像的位姿进行标注,并利用样本彩色图像的标注位姿与样本初始位姿之间差异来调整位姿估计模型中的网络参数,从而减少了标注工作量,提高了位姿估计模型的训练效率。
一些公开实施例中,上述步骤S12可以包括以下步骤S121至步骤S123:
步骤S121,利用位姿估计模型对样本彩色图像进行目标检测,得到待定位对象的位置。示例性地,位姿估计模型中包含目标检测子网络,目标检测子网络被配置为对样本彩色图像进行目标检测,得到待定位对象在样本彩色图像中的位置。在一些实施例中,目标检测子网络与位姿估计模型也可以是相互独立的,即,先使用目标检测子网络对样本彩色图像进行目标检测,得到待定位对象的位置,然后,使用位姿估计模型基于目标检测子网络的检测结果对样本彩色图像进行处理。
步骤S122,基于待定位对象在样本彩色图像中的位置,对样本彩色图像进行裁剪,得到包含待定位对象的局部图像。示例性地,裁剪的方式可以是以待定位对象在样本彩色图像上所处的区域向外扩展预设的尺度后进行裁剪,将裁剪得到的包含待定位对象的部分图像作为该局部图像。
步骤S123,对局部图像进行处理,得到待定位对象的样本初始位姿。
示例性地,合成图像的背景和前景可能存在较大的差异,若直接对合成图像进行处理,合成图像中面积较大的背景部分可能会影响对待定位对象所在的前景部分的检测,导致估计的样本初始位姿的准确度较低。本公开实施例中,先通过目标检测子网络对样本彩色图像进行目标检测,得到待定位对象的位置,然后,对样本彩色图像进行裁剪,得到包含待定位对象的局部图像,最后,通过对局部图像进行处理得到待定位对象的样本初始位姿,这样,可以减少所处理的局部图像中的背景部分,从而降低合成图像中的背景对处理结果的干扰,提高样本初始位姿的估计准确度。
在一些公开实施例中,上述步骤S12还可以包括以下步骤S124至步骤S125:
步骤S124,利用位姿估计模型,确定待定位对象的至少一个三维关键点在样本彩色图像上的投影位置。如上述,可以对样本彩色图像进行裁剪得到包含待定位对象的局部图像,确定待定位对象的至少一个三维关键点在样本彩色图像上的投影位置可以是确定各三维关键点在该局部图像上的投影位置。其中,待定位对象的至少一个三维关键点可以是在待定位对象对应的预设三维模型上提取的。示例性地,至少一个三维关键点可以是通过最远点采样算法从预设三维模型上获取的三维点集。
步骤S125,基于各三维关键点在样本彩色图像上的投影位置以及目标相机的内参,确定待定位对象的样本初始位姿。其中,在确定各三维关键点在局部图像上的 投影位置之后,结合目标相机的内参,可以通过解决PnP(Perspective-n-Point)问题的方式,确定待定位对象的样本初始位姿。示例性地,目标相机的内参可以包括焦距等参数。在一些实施例中,使用解决PnP问题的方式获取待定位对象的样本初始位姿的方式此处不做过多叙述。此外,目标相机可以是采集样本图像的相机,同上述拍摄设备。
在一些实现方式中,可以通过位姿估计模型确定待定位对象的三维关键点在样本彩色图像上的投影位置,然后,根据确定的三维关键点的投影位置以及目标相机的内部参数,得到待定位对象的样本初始位姿。
在一些实现方式中,上述步骤S124可以包括步骤S1241至步骤S1243:
步骤S1241,利用位姿估计模型,预测各对象像素点分别到每一投影位置的方向向量。
其中,对象像素点为样本彩色图像中属于待定位对象的像素点。如上述,位姿估计模型可以对样本彩色图像进行目标检测,得到待定位对象在样本彩色图像中的位置。其中,将属于待定位对象的像素点的语义标签设置为第一预设值(例如,第一预设值可以是1),将不属于待定位对象的像素点的语义标签设置为第二预设值(例如,第二预设值可以0),并且将语义标签为第一预设值的像素点作为对象像素点。
方向向量可以是二维向量,其中,一个维度是样本彩色图像的x轴方向上的向量(例如,样本彩色图像的横轴方向上的向量),另一个维度可以是样本彩色图像的y轴方向上的向量(例如,样本彩色图像的纵轴方向上的向量)。
示例性地,对象像素点到投影位置的方向向量可参考公式(1):
vk(p)=xk-p  (1);
其中,vk(p)表示对象像素点p到第k个投影位置的方向向量,xk表示第k个投影位置的二维坐标,p表示像素点p的位置的二维坐标。
对于每一投影位置,从与该投影位置对应的至少一个方向向量中确定预设数量的方向向量,生成各方向向量对应的候选投影位置。示例性地,将各对象像素点的位置与对象像素点对应的方向向量求和,得到各对象像素点对应的候选投影位置。示例性地,若样本彩色图像中属于待定位对象的对象像素点有10个,预设数量为5,即可以从10个方向向量中选择5个方向向量,每一方向向量有对应的对象像素点,每一对象像素点与其对应的方向向量相加,即可得到该对象像素点对应的候选投影位置,即得到的候选投影位置为5个。
一些应用场景中,位姿估计模型的输入为至少一张样本彩色图像,输出结果为各像素点对应的语义标签以及各像素点对应的方向向量。其中,语义标签用于表示该像素点是否属于待定位对象。一些应用场景中,输入的若干张样本彩色图像中可 以包括多种待定位对象,输出的各像素点对应的语义标签可以是该像素点属于哪种待定位对象的标签。示例性地,多种待定位对象可以包括杯子、桌子、凳子等。也即,根据本公开实施例提供的位姿估计模型的训练方法得到的位姿估计模型能够同时对多种待定位对象进行位姿估计,得到各待定位对象的目标位姿。
步骤S1242,基于方向向量与对应的对象像素点,确定对应的候选投影位置的方式可参考公式(2):
hk,i=p+vk(p)  (2);
其中,{hk,i|i=1,2,…,N},其中,N是候选投影位置的数量,p表示对象像素点,vk(p)表示对象像素点p到第k个投影位置的方向向量。
步骤S1243,基于各候选投影位置之间的位置关系,确定各候选投影位置的分数,将分数满足预设要求的候选投影位置,作为投影位置。
可选地,上述基于各候选投影位置之间的位置关系,确定各候选投影位置的分数的方式可以是:对于每一候选投影位置,确定该候选投影位置与其他候选投影位置之间的目标距离的数量,并将目标距离的数量作为当前候选投影位置的分数。其中,目标距离为小于或等于预设距离的距离。示例性地,将当前候选投影位置与其他候选投影位置作差,得到当前候选投影位置和其他候选投影位置对应的距离。其中,预设距离的大小可以在对位姿估计模型的训练过程中进行调整,以确定最终的预设距离。
示例性地,计算各候选投影位置的分数的方式可参考公式(3):
wk,i=∑I(|hk,i-p-vk(p)|≤θ)  (3);
其中,wk,i表示第i个候选投影位置的分;I是指示函数,满足条件为1,不满足条件则为0;θ为预设距离,例如θ可以取值为1。
其中,上述将分数满足预设要求的候选投影位置,作为投影位置的方式可以是:将最大分数对应的候选投影位置作为投影位置。其中,其他三维关键点对应的投影位置的确定方式可参考上述过程。
这样,通过确定各候选投影位置之间的距离,确定得到最终的投影位置,使确定的投影位置更为准确。
一些应用场景中,本公开实施例提供的位姿估计模型的训练方法还可包括对位姿估计模型的预训练步骤。
其中,预训练的步骤可以是:获取若干样本图像,该样本图像与上述步骤S11获取的样本图像可以相同,也可以不同。获取样本图像上各像素点的样本语义标签以及样本投影位置。确定基于位姿估计模型输出的语义标签与样本语义标签之间的第一损失,以及基于位姿估计模型输出的各对象像素点的方向向量确定投影位置,确定投影位置与样本投影位置之间的第二损失,基于第一损失和第二损失,调整位 姿估计模型中的网络参数。
这里,在预训练中,将初始学习率可以设置为1e-3,每隔第一预定迭代次数之后,学习率减半。在预训练之后,学习率可以调整为5e-4,每隔第二预定迭代次数之后,学习率减半。可选地,第一预定迭代次数为第二预定迭代次数的两倍。
这样,本公开实施例中,首先,基于各对象像素点关于投影位置的方向向量确定至少一个候选投影位置,然后从至少一个候选投影位置中选出满足要求的候选投影位置,作为最终的投影位置,使得到的投影位置更为准确。
参见图2,图2为图1中所示的位姿估计模型的训练方法中的步骤S13的子流程示意图。如图2所述,上述步骤S13可以包括步骤S131至步骤S133:
步骤S131,基于样本初始位姿以及待定位对象对应的预设三维模型,确定关于待定位对象的渲染深度图。
其中,预设三维模型可以是利用绘图软件绘制得到的,或利用建模网络使用至少一张包含待定位对象的图像对待定位对象进行三维建模得到的。
其中,样本初始位姿可以认为是待定位对象对应的预设三维模型在相机坐标系下的位姿。上述基于样本初始位姿以及待定位对象对应的预设三维模型,确定关于待定位对象的渲染深度图的方式可以是基于样本初始位姿,将该预设三维模型投影到相机平面上,得到该渲染深度图。
步骤S132,利用渲染深度图和样本深度图像之间的差异,确定优化项。
其中,位姿估计模型的训练方法还可包括以下步骤:基于样本初始位姿以及预设三维模型,确定关于待定位对象的法线图。示例性地,预设三维模型可以是由至少一个平面(例如,三角形网格面)构成,每一平面在法线图中对应一个像素点的像素值,该像素值可以用于表示该平面的法线方向。
其中,利用渲染深度图和样本深度图像之间的差异,确定优化项的方式可以是:
分别对渲染深度图和样本深度图像进行反投影,得到渲染深度图对应的第一点云和样本深度图像对应的第二点云。其中,第一点云包括至少一个对象像素点对应的第一三维点,第二点云包括各对象像素点对应的第二三维点。如上述,对象像素点为样本彩色图像中属于待定位对象的像素点。在一些实施例中,使用待定位对象的样本初始位姿对该渲染深度图进行反投影,得到第一点云;使用待定位对象的样本初始位姿对该样本深度图像进行反投影,得到第二点云。
然后,对于每一对象像素点,确定对象像素点对应的偏差表征值。偏差表征值可以是残差。其中,对象像素点对应的偏差表征值为对象像素点对应的目标位姿差与对象像素点在法线图中对应的法线方向之间的乘积。其中,目标位姿差为对象像素点对应的第一三维点和对应的第二三维点之间的位姿差。
其中,对于对象像素点p,确定其偏差表征值L(p)的方式可以参考公式(4):
L(p)=||(π-1(Dr(p))-π-1(D(p)))Nr(p)||2  (4);
其中,π-1是反投影函数,Dr(p)表示对象像素点p在渲染深度图中的深度值,D(p)表示对象像素点p在样本深度图中的深度值,Nr(p)表示对象像素点p在法线图中对应的法线方向。π-1(Dr(p))-π-1(D(p))表示第一三维点和对应的第二三维点之间的位姿差。
对于对象像素点的偏差表征值,可以是从第二三维点到第一三维点的平面的最小距离,该平面是由第一三维点及其法线定义的。
接着,结合各对象像素点对应的偏差表征值,确定优化项。示例性地,将各偏差表征值之和、各偏差表征值的平均值或各偏差表征值中的最大值作为优化项。利用梯度下降法最小化这些残差项以得到期望的优化位姿。一些应用场景中,为避免优化结果收敛到局部最小值,可以通过扰动样本初始位姿生成一组假设的样本初始位姿。然后,对这些姿态进行优化,得到更准确的优化位姿。
步骤S133,调整样本初始位姿,以使优化项满足预设要求,并将调整后的样本初始位姿作为优化位姿。
在一些实现方式中,预设要求可以是优化项最小化。
在本公开的实施例中,基于样本初始位姿以及待定位对象对应的预设三维模型,确定关于待定位对象的渲染深度图,然后基于待定位对象的渲染深度图和样本深度图像之间的差异,构建各对象像素点对应的优化项,利用优化项对样本初始位姿进行调整,使得调整后的样本初始位姿更为准确。
另外,通过将待定位对象的样本初始位姿反投影到渲染深度图和样本深度图像,分别得到第一点云和第二点云,以及基于第一点云中的三维点与第二点云中的三维点之间的差异,结合各点的法线方向,得到对象像素点对应的偏差表征值,使得确定的偏差表征值更为准确。
一些公开实施例中,在步骤S13之后、步骤S14之前,还包括以下步骤:
判断优化位姿是否为预设错误估计位姿,响应于优化位姿不为预设错误估计位姿,执行基于优化位姿与样本初始位姿之间的差异,调整位姿估计模型中的网络参数的步骤。
在一些实施例中,响应于优化位姿为预设错误估计位姿,将该优化位姿丢弃,并且不执行基于该优化位姿与样本初始位姿之间的差异,调整位姿估计模型中的网络参数的步骤。
通过在优化位姿不为预设错误估计位姿的情况下,使用该优化位姿与样本初始位姿之间的差异调整位姿估计模型中的网络参数,可以减少错误估计对位姿估计模型的扰乱,提高最终获得的位姿估计模型的准确性。
其中,判断优化位姿是否为预设错误估计位姿的方式可以是:
首先,获取各对象像素点对应的偏差表征值之间的集中趋势表征值。如上述,对象像素点为样本彩色图像中属于待定位对象的像素点,对象像素点对应的偏差表征值为对象像素点对应的目标位姿差与对象像素点对应的法线方向之间的乘积,目标位姿差为对象像素点对应的第一三维点和第二三维点之间的位姿差,其中,第一三维点为待定位对象的渲染深度图对应的第一点云中的三维点,第二三维点为待定位对象的样本深度图像对应的第二点云中的三维点。第一点云由渲染深度图经过反投影得到,第二点云由样本深度图像经过反投影得到。
在一些实施例中,集中趋势表征值为各对象像素点对应的偏差表征值的平均值。
然后,判断集中趋势表征值是否小于或等于预设尺寸。其中,预设尺寸与待定位对象在物理世界下的尺寸相关。例如,预设尺寸可以是待定位对象长度的0.2倍。
最后,响应于集中趋势表征值小于或等于预设尺寸,确定优化位姿不为预设错误估计位姿;响应于集中趋势表征值大于预设尺寸,确定优化位姿为预设错误估计位姿。
通过在集中趋势表征值不大于预设尺寸的情况下,认为优化位姿不是预设错误估计位姿,能够基于待定位对象的物理尺寸对优化位姿进行过滤。
为更好地理解本公开实施例提供的位姿估计模型的训练方法,可参考图3,图3是本公开实施例提供的一种位姿估计模型的训练方法的流程示意图。
如图3所示,给定一组包含待定位对象的未注释的样本图像数据,该样本图像数据包含至少一张样本彩色图像和该至少一张样本彩色图像对应的至少一张样本深度图像,首先,利用位姿估计模型预测至少一张样本彩色图像中的待定位对象的初始姿态;然后,利用至少一张样本深度图像中的深度信息对估计的样本初始姿态进行位姿优化,其中,位姿优化的方式可以是迭代优化;再次,对优化获得的位姿进行评估,即判断优化获得的位姿是否是预设错误估计,并基于评估结果对优化位姿进行过滤,丢弃错误的位姿估计;最后,确定保留的优化位姿与样本初始位姿之间的差异,并基于该差异调整位姿估计模型的网络参数。
在位姿估计过程中,本公开实施例提供的位姿估计模型可以以一张样本彩色图像估计待定位对象的6D姿态。
在一些实施例中,本公开实施例提供的位姿估计模型的训练方法可以应用于增强现实应用领域。
请参见图4,图4是本公开实施例提供的一种位姿估计方法的流程示意图。如图4所示,本公开实施例提供的位姿估计方法可以包括步骤S21至步骤S23:
步骤S21,获取包含待定位对象的目标图像,目标图像包括目标彩色图像和目标彩色图像对应的目标深度图像。
其中,包含待定位对象的目标图像可以是位姿估计方法的执行设备拍摄得到的,也可以是由与执行设备建立通信连接的其他设备拍摄得到的。
步骤S22,利用位姿估计模型对目标彩色图像进行处理,得到待定位对象的目标初始位姿。
其中,得到目标初始位姿的方式可参考上述位姿估计模型的训练方法实施例中获取样本初始位姿的方式。其中,位姿估计模型是利用上述位姿估计模型的训练方法实施例提供的训练方法获得的。
步骤S23,基于目标深度图像中待定位对象的深度信息,对目标初始位姿进行优化,得到待定位对象的目标位姿。
其中,得到待定位对象的目标位姿的方式可参考上述位姿估计模型的训练方法实施例中获取优化位姿的方式。
在本公开实施例提供的位姿估计方法中,利用位姿估计模型对目标彩色图像进行处理,得到待定位对象的目标初始位姿之后,再利用目标深度图像对目标初始位姿进行优化,这样,对待定位对象的位姿估计过程既利用了目标彩色图像中颜色、纹理、轮廓等特征,又利用了目标深度图像中的深度特征,使得优化后的待定位对象的目标位姿更为准确。
在一些实施例中,本公开实施例提供的位姿估计方法可以应用于增强现实应用领域。
在一些实施例中,本公开实施例提供的位姿估计方法的执行主体可以是位姿估计装置,该位姿估计装置可以是任意一种能够执行本公开实施例提供的位姿估计方法的终端设备、服务器或者其它处理设备。其中,终端设备可以为增强现实显示设备、视觉定位设备、用户设备(User Equipment,UE)、移动设备、用户终端、终端、蜂窝电话、无绳电话、个人数字处理(Personal Digital Assistant,PDA)、手持设备、计算设备、车载设备、可穿戴设备等。在一些实现方式中,该位姿估计方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。
本领域技术人员可以理解,本公开实施例提供的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的执行顺序应当以其功能和可能的内在逻辑确定。
请参阅图5,图5是本公开实施例提供的一种位姿估计模型的训练装置的组成结构示意图。位姿估计模型的训练装置50包括样本图像获取部分51、样本位姿估计部分52、样本位姿优化部分53以及参数调整部分54。其中,样本图像获取部分51,被配置为获取包含待定位对象的样本图像,样本图像包含样本彩色图像和样本彩色图像对应的样本深度图像;样本位姿估计部分52,被配置为利用位姿估计模型对样本彩色图像进行处理,得到待定位对象的样本初始位姿;样本位姿优化部分53, 被配置为基于样本深度图像中待定位对象的深度信息,对样本初始位姿进行优化,得到待定位对象的优化位姿;参数调整部分54,被配置为基于优化位姿与样本初始位姿之间的差异,调整位姿估计模型中的网络参数。
上述方案中,利用位姿估计模型对样本彩色图像进行处理,得到待定位对象的样本初始位姿之后,再利用样本深度图像对样本初始位姿进行优化,这样,对待定位对象的位姿估计过程既利用了样本彩色图像中颜色、纹理、轮廓等特征,又利用了样本深度图像中的深度特征,使得优化后的待定位对象的优化位姿更为准确。并且,通过使用优化位姿和样本位姿之间的差异,调整位姿估计模型中的网络参数,而无需对样本彩色图像进行标注,减少了标注工作量,提高了对位姿估计模型的训练效率。
一些公开实施例中,样本位姿优化部分53基于样本深度图像中待定位对象的深度信息,对样本初始位姿进行优化,得到待定位对象的优化位姿,包括:基于样本初始位姿以及待定位对象对应的预设三维模型,确定关于待定位对象的渲染深度图;利用渲染深度图和样本深度图像之间的差异,确定优化项;调整样本初始位姿,以使优化项满足预设要求,并将调整后的样本初始位姿作为优化位姿。
上述方案中,通过基于样本初始位姿以及待定位对象对应的预设三维模型,确定关于待定位对象的渲染深度图,然后基于渲染深度图和样本深度图像之间的差异,构建优化项,利用优化项对样本初始位姿进行调整,使得调整后的样本初始位姿更为准确。
一些公开实施例中,预设要求为优化项最小化;和/或,样本位姿优化部分53还被配置为:基于样本初始位姿以及预设三维模型,确定关于待定位对象的法线图;以及,样本位姿优化部分53利用渲染深度图和样本深度图像之间的差异,确定优化项,包括:分别对渲染深度图和样本深度图像进行反投影,得到渲染深度图对应的第一点云和样本深度图像对应的第二点云,第一点云中包括至少一个对象像素点对应的第一三维点、第二点云中包括各对象像素点对应的第二三维点,对象像素点为样本彩色图像中属于待定位对象的像素点;对于每一对象像素点,确定对象像素点对应的偏差表征值,偏差表征为对象像素点对应的目标位姿差与对象像素点在法线图中对应的法线方向之间的乘积,其中,目标位姿差为对象像素点对应的第一三维点和对应的第二三维点之间的位姿差;结合各对象像素点对应的偏差表征值,确定优化项。
上述方案中,通过反投影渲染深度图和样本深度图像得到第一点云和第二点云,然后,基于第一点云中的三维点与第二点云中的三维点之间的差异,集合各点的法线方向,确定各对象像素点对应的偏差表征值,使得确定的偏差表征值更为准确。
一些公开实施例中,基于优化位姿与样本初始位姿之间的差异,调整位姿估计 模型中的网络参数之前,调整部分54还被配置为:判断优化位姿是否为预设错误估计位姿;响应于优化位姿不为预设错误估计位姿,执行基于优化位姿与样本初始位姿之间的差异,调整位姿估计模型中的网络参数的步骤。
上述方案中,通过在优化位姿不为预设错误估计位姿的情况下,使用该优化位姿与样本初始位姿之间的差异,调整位姿估计模型中的网络参数,可以减少错误估计对位姿估计模型的扰乱。
一些公开实施例中,调整部分54判断优化位姿是否为预设错误估计位姿,包括:获取各对象像素点对应的偏差表征值之间的集中趋势表征值,其中,对象像素点为样本彩色图像中属于待定位对象的像素点,对象像素点对应的偏差表征值为对象像素点对应的目标位姿差与对象像素点对应的法线方向之间的乘积,目标位姿差为对象像素点对应的第一三维点和对应的第二三维点之间的位姿差,第一三维点为渲染深度图对应的第一点云中三维点,第二三维点为样本深度图像对应的第二点云中的三维点;判断集中趋势表征值是否小于或等于预设尺寸,预设尺寸与待定位对象在物理世界下的尺寸相关;响应于集中趋势表征值小于或等于预设尺寸,确定优化位姿不为预设错误估计位姿。
在一些实施例中,集中趋势表征值为各对象像素点对应的偏差表征值的平均值。
上述方案中,通过在集中趋势表征值不大于预设尺寸的情况下,认为优化位姿不是预设错误估计位姿,实现基于待定位对象的物理尺寸对优化位姿的过滤。
一些公开实施例中,样本位姿估计部分52利用位姿估计模型对样本彩色图像处理,得到待定位对象的样本初始位姿,包括:利用位姿估计模型,确定关于待定位对象的至少一个三维关键点在样本彩色图像上的投影位置;基于各三维关键点在样本彩色图像上的投影位置以及目标相机的内参,确定待定位对象的样本初始位姿。
上述方案中,通过位姿估计模型,能够确定待定位对象的三维关键点在样本彩色图像上的投影位置,从而根据确定的三维关键点的投影位置以及目标相机的内部参数,得到待定位对象的样本初始位姿。
一些公开实施例中,样本位姿估计部分52利用位姿估计模型,确定待定位对象的至少一个三维关键点在样本彩色图像上的投影位置,包括:利用位姿估计模型,预测各对象像素点分别到每一投影位置的方向向量,对象像素点为样本彩色图像中属于待定位对象的像素点;对于每一投影位置,从与投影位置对应的至少一个方向向量中确定预设数量的方向向量,生成各方向向量对应的候选投影位置;基于各候选投影位置之间的位置关系,确定各候选投影位置的分数;将分数满足预设要求的候选投影位置,作为投影位置。
上述方案中,通过基于各像素点关于投影位置的方向向量确定至少一个候选投影位置,然后从至少一个候选投影位置中选出满足要求的候选投影位置,作为最终 的投影位置,使得确定得到的投影位置更为准确。
一些公开实施例中,样本位姿估计部分52从与投影位置对应的至少一个方向向量中确定预设数量的方向向量,生成各方向向量对应的候选投影位置,包括:将各对象像素点的位置与对象像素点对应的方向向量求和,得到各对象像素点对应的候选投影位置;基于各候选投影位置之间的位置关系,确定各候选投影位置的分数,包括:对于每一候选投影位置,确定候选投影位置与其他候选投影位置之间的目标距离的数量,并将目标距离的数量作为分数,目标距离为小于或等于预设距离的距离;将分数满足预设要求的候选投影位置,作为投影位置,包括:将最大分数对应的候选投影位置作为投影位置。
上述方案中,通过确定各候选投影位置之间的距离,确定最终的投影位置,使得确定的投影位置更为准确。
一些公开实施例中,样本位姿估计部分52利用位姿估计模型对样本彩色图像处理,得到待定位对象的样本初始位姿,包括:利用位姿估计模型对样本彩色图像进行目标检测,得到待定位对象的位置;基于待定位对象的位置,对样本彩色图像进行裁剪,得到包含待定位对象的局部图像;对局部图像进行处理,得到待定位对象的样本初始位姿。
上述方案中,通过先对样本彩色图像进行目标检测,得到待定位对象的位置之后,对样本彩色图像进行裁剪,得到包含待定位对象的局部图像,通过对局部图像进行处理得到待定位对象的样本初始位姿,由此能够减少背景的干扰,从而提高样本初始位姿的识别准确度。
请参阅图6,图6是本公开实施例提供的一种位姿估计装置的组成结构示意图。
位姿估计装置60包括目标图像获取部分61、目标位姿估计部分62以及目标位姿优化部分63。其中,目标图像获取部分61,被配置为获取包含待定位对象的目标图像,所述目标图像包括目标彩色图像和所述目标彩色图像对应的目标深度图像;目标位姿估计部分62,被配置为利用位姿估计模型对所述目标彩色图像进行处理,得到所述待定位对象的目标初始位姿;目标位姿优化部分63,被配置为基于所述目标深度图像中所述待定位对象的深度信息,对所述目标初始位姿进行优化,得到所述待定位对象的目标位姿;其中,所述位姿估计模型是利用上述位姿估计模型的训练装置实施例中提供的位姿估计模型的训练装置训练得到的。
上述方案中,利用位姿估计模型对目标彩色图像进行处理,得到待定位对象的目标初始位姿之后,再利用目标深度图像对目标初始位姿进行优化,这样,对待定位对象的位姿估计过程既利用了目标彩色图像中颜色、纹理、轮廓等特征,又利用了目标深度图像中的深度特征,使得优化后的待定位对象的目标位姿更为准确。
请参阅图7,图7是本公开实施例提供的一种电子设备的结构示意图。
电子设备70包括相互耦接的存储器71和处理器72,处理器72被配置为执行存储器71中存储的计算机程序指令,以实现上述任一位姿估计模型的训练方法实施例的步骤,或实现上述任一位姿估计方法实施例中的步骤。在一些实施场景中,电子设备70可以包括但不限于:微型计算机、服务器,此外,电子设备70还可以包括笔记本电脑、平板电脑等移动设备,在此不做限定。
在一些实施例中,处理器72被配置为控制其自身以及存储器71以实现上述任一位姿估计模型的训练方法实施例中的步骤,或实现上述任一位姿估计方法实施例中的步骤。处理器72还可以称为CPU(Central Processing Unit,中央处理单元)。处理器72可能是一种集成电路芯片,具有信号的处理能力。处理器72还可以是通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。另外,处理器72可以由集成电路芯片共同实现。
上述方案中,利用位姿估计模型对样本彩色图像进行处理,得到待定位对象的样本初始位姿之后,再利用样本深度图像对样本初始位姿进行优化,这样,对待定位对象的位姿估计过程既利用了样本彩色图像中颜色、纹理、轮廓等特征,又利用了样本深度图像中的深度特征,使得优化后的待定位对象的优化位姿更为准确。并且,利用优化位姿和样本位姿之间的差异,调整位姿估计模型中的网络参数,无需对样本彩色图像进行标注,减少了标注工作量,提高了对位姿估计模型的训练效率。
请参阅图8,图8为本公开实施例提供的一种计算机可读存储介质的结构示意图。
计算机可读存储介质80存储有能够被处理器运行的计算机程序指令801,程序指令801用于实现上述任一位姿估计模型的训练方法实施例中的步骤,或实现上述任一位姿估计方法实施例中的步骤。
上述方案中,利用位姿估计模型对样本彩色图像进行处理,得到待定位对象的样本初始位姿之后,再利用样本深度图像对样本初始位姿进行优化,这样,对待定位对象的位姿估计过程既利用了样本彩色图像中颜色、纹理、轮廓等特征,又利用了样本深度图像中的深度特征,使得优化后的待定位对象的优化位姿更为准确。并且,利用优化位姿和样本位姿之间的差异,调整位姿估计模型中的网络参数,无需对样本彩色图像进行标注,减少了标注工作量,提高了对位姿估计模型的训练效率。
在一些实施例中,本公开实施例提供的装置具有的功能或包含的模块可以用于执行上文方法实施例描述的方法,其实现可以参照上文方法实施例的描述。
上文对各个实施例的描述倾向于强调各个实施例之间的不同之处,其相同或相 似之处可以互相参考。
在本公开所提供的几个实施例中,应该理解到,所揭露的方法和装置,可以通过其它的方式实现。例如,以上所描述的装置实施方式仅仅是示意性的,例如,模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性、机械或其它的形式。
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本公开各个实施方式方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
若本公开技术方案涉及个人信息,应用本公开技术方案的产品在处理个人信息前,已明确告知个人信息处理规则,并取得个人自主同意。若本公开技术方案涉及敏感个人信息,应用本公开技术方案的产品在处理敏感个人信息前,已取得个人单独同意,并且同时满足“明示同意”的要求。例如,在摄像头等个人信息采集装置处,设置明确显著的标识告知已进入个人信息采集范围,将会对个人信息进行采集,若个人自愿进入采集范围即视为同意对其个人信息进行采集;或者在个人信息处理的装置上,利用明显的标识/信息告知个人信息处理规则的情况下,通过弹窗信息或请个人自行上传其个人信息等方式获得个人授权;其中,个人信息处理规则可包括个人信息处理者、个人信息处理目的、处理方式以及处理的个人信息种类等信息。
工业实用性
本公开实施例提供了一种位姿估计方法及相关模型的训练方法、装置、电子设备、计算机可读介质和计算机程序产品,其中,位姿估计模型的训练方法,包括:获取包含待定位对象的样本图像,样本图像包含样本彩色图像和样本彩色图像对应的样本深度图像;利用位姿估计模型对样本彩色图像处理,得到待定位对象的样本 初始位姿;基于样本深度图像中待定位对象的深度信息,对样本初始位姿进行优化,得到待定位对象的优化位姿;基于优化位姿与样本初始位姿之间的差异,调整位姿估计模型中的网络参数。这样,利用位姿估计模型对样本彩色图像进行处理,得到待定位对象的样本初始位姿之后,再利用样本深度图像对样本初始位姿进行优化,使得对待定位对象的位姿估计过程既利用了样本彩色图像中颜色、纹理、轮廓等特征,又利用了样本深度图像中的深度特征,使得优化后的待定位对象的优化位姿更为准确。并且,利用优化位姿和样本位姿之间的差异,调整位姿估计模型中的网络参数,无需对样本彩色图像进行标注,减少了标注工作量,提高了对位姿估计模型的训练效率。

Claims (25)

  1. 一种位姿估计模型的训练方法,包括:
    获取包含待定位对象的样本图像,所述样本图像包含样本彩色图像和所述样本彩色图像对应的样本深度图像;
    利用位姿估计模型对所述样本彩色图像进行处理,得到所述待定位对象的样本初始位姿;
    基于所述样本深度图像中所述待定位对象的深度信息,对所述样本初始位姿进行优化,得到所述待定位对象的优化位姿;
    基于所述优化位姿与所述样本初始位姿之间的差异,调整所述位姿估计模型中的网络参数。
  2. 根据权利要求1所述的方法,其中,所述基于所述样本深度图像中所述待定位对象的深度信息,对所述样本初始位姿进行优化,得到所述待定位对象的优化位姿,包括:
    基于所述样本初始位姿以及所述待定位对象对应的预设三维模型,确定关于所述待定位对象的渲染深度图;
    利用所述渲染深度图和所述样本深度图像之间的差异,确定优化项;
    调整所述样本初始位姿,以使所述优化项满足预设要求,并将调整后的样本初始位姿作为所述优化位姿。
  3. 根据权利要求2所述的方法,其中,所述预设要求为所述优化项最小化;和/或,
    所述方法还包括:基于所述样本初始位姿以及所述预设三维模型,确定关于所述待定位对象的法线图;以及,所述利用所述渲染深度图和所述样本深度图像之间的差异,确定优化项,包括:
    分别对所述渲染深度图和所述样本深度图像进行反投影,得到所述渲染深度图对应的第一点云和所述样本深度图像对应的第二点云,所述第一点云中包括至少一个对象像素点对应的第一三维点、所述第二点云中包括各所述对象像素点对应的第二三维点,所述对象像素点为所述样本彩色图像中属于所述待定位对象的像素点;
    对于每一所述对象像素点,确定所述对象像素点对应的偏差表征值,所述偏差表征值为所述对象像素点对应的目标位姿差与所述对象像素点在所述法线图中对应的法线方向之间的乘积,其中,所述目标位姿差为所述对象像素点对应的第一三维点和对应的第二三维点之间的位姿差;
    基于各所述对象像素点对应的偏差表征值,确定所述优化项。
  4. 根据权利要求2或3所述的方法,其中,所述基于所述优化位姿与所述样本 初始位姿之间的差异,调整所述位姿估计模型中的网络参数之前,所述方法还包括:
    判断所述优化位姿是否为预设错误估计位姿;
    响应于所述优化位姿不为所述预设错误估计位姿,执行所述基于所述优化位姿与所述样本初始位姿之间的差异,调整所述位姿估计模型中的网络参数的步骤。
  5. 根据权利要求4所述的方法,其中,所述判断所述优化位姿是否为预设错误估计位姿,包括:
    获取各对象像素点对应的偏差表征值之间的集中趋势表征值,其中,所述对象像素点为所述样本彩色图像中属于所述待定位对象的像素点,所述对象像素点对应的偏差表征值为所述对象像素点对应的目标位姿差与所述对象像素点对应的法线方向之间的乘积,所述目标位姿差为所述对象像素点对应的第一三维点和对应的第二三维点之间的位姿差,所述第一三维点为所述渲染深度图对应的第一点云中三维点,所述第二三维点为所述样本深度图像对应的第二点云中的三维点;
    判断所述集中趋势表征值是否小于或等于预设尺寸,所述预设尺寸与所述待定位对象在物理世界下的尺寸相关;
    响应于所述集中趋势表征值小于或等于所述预设尺寸,确定所述优化位姿不为所述预设错误估计位姿。
  6. 根据权利要求5所述的方法,其中,所述集中趋势表征值为各对象像素点对应的偏差表征值的平均值。
  7. 根据权利要求1-6任一项所述的方法,其中,所述利用位姿估计模型对所述样本彩色图像处理,得到所述待定位对象的样本初始位姿,包括:
    利用所述位姿估计模型,确定所述待定位对象的至少一个三维关键点在所述样本彩色图像上的投影位置;
    基于各所述三维关键点在所述样本彩色图像上的投影位置以及目标相机的内参,确定所述待定位对象的样本初始位姿。
  8. 根据权利要求7所述的方法,其中,所述利用所述位姿估计模型,确定所述待定位对象的至少一个三维关键点在所述样本彩色图像上的投影位置,包括:
    利用所述位姿估计模型,预测各对象像素点分别到每一所述投影位置的方向向量,所述对象像素点为所述样本彩色图像中属于所述待定位对象的像素点;
    对于每一所述投影位置,从与所述投影位置对应的至少一个方向向量中确定预设数量的方向向量,生成各所述方向向量对应的候选投影位置;
    基于各所述候选投影位置之间的位置关系,确定各所述候选投影位置的分数;
    将所述分数满足预设要求的候选投影位置,作为所述投影位置。
  9. 根据权利要求8所述的方法,其中,所述从与所述投影位置对应的至少一个方向向量中确定预设数量的方向向量,生成各所述方向向量对应的候选投影位置, 包括:
    将各所述对象像素点的位置与所述对象像素点对应的方向向量求和,得到各所述对象像素点对应的候选投影位置;
    所述基于各所述候选投影位置之间的位置关系,确定各所述候选投影位置的分数,包括:
    对于每一所述候选投影位置,确定所述候选投影位置与其他候选投影位置之间的目标距离的数量,并将所述目标距离的数量作为所述分数,所述目标距离为小于或等于预设距离的距离;
    所述将所述分数满足预设要求的候选投影位置,作为所述投影位置,包括:
    将最大分数对应的候选投影位置作为所述投影位置。
  10. 根据权利要求1-9任一项所述的方法,其中,所述利用位姿估计模型对所述样本彩色图像处理,得到所述待定位对象的样本初始位姿,包括:
    利用所述位姿估计模型对所述样本彩色图像进行目标检测,得到所述待定位对象的位置;
    基于所述待定位对象的位置,对所述样本彩色图像进行裁剪,得到包含所述待定位对象的局部图像;
    对所述局部图像进行处理,得到所述待定位对象的样本初始位姿。
  11. 一种位姿估计方法,包括:
    获取包含待定位对象的目标图像,所述目标图像包括目标彩色图像和所述目标彩色图像对应的目标深度图像;
    利用位姿估计模型对所述目标彩色图像进行处理,得到所述待定位对象的目标初始位姿;
    基于所述目标深度图像中所述待定位对象的深度信息,对所述目标初始位姿进行优化,得到所述待定位对象的目标位姿;
    其中,所述位姿估计模型是利用权利要求1至10任一项所述的位姿估计模型的训练方法训练得到的。
  12. 一种位姿估计模型的训练装置,包括:
    样本图像获取部分,被配置为获取包含待定位对象的样本图像,所述样本图像包含样本彩色图像和所述样本彩色图像对应的样本深度图像;
    样本位姿估计部分,被配置为利用位姿估计模型对所述样本彩色图像处理,得到所述待定位对象的样本初始位姿;
    样本位姿优化部分,被配置为基于所述样本深度图像中所述待定位对象的深度信息,对所述样本初始位姿进行优化,得到所述待定位对象的优化位姿;
    参数调整部分,被配置为基于所述优化位姿与所述样本初始位姿之间的差异, 调整所述位姿估计模型中的网络参数。
  13. 根据权利要求12所述的装置,其中,所述样本位姿优化部分,还被配置为:
    基于所述样本初始位姿以及所述待定位对象对应的预设三维模型,确定关于所述待定位对象的渲染深度图;
    利用所述渲染深度图和所述样本深度图像之间的差异,确定优化项;
    调整所述样本初始位姿,以使所述优化项满足预设要求,并将调整后的样本初始位姿作为所述优化位姿。
  14. 根据权利要求13所述的装置,其中,所述预设要求为所述优化项最小化;和/或,
    所述样本位姿优化部分,还被配置为:
    基于所述样本初始位姿以及所述预设三维模型,确定关于所述待定位对象的法线图;以及,
    所述利用所述渲染深度图和所述样本深度图像之间的差异,确定优化项,包括:
    分别对所述渲染深度图和所述样本深度图像进行反投影,得到所述渲染深度图对应的第一点云和所述样本深度图像对应的第二点云,所述第一点云中包括至少一个对象像素点对应的第一三维点、所述第二点云中包括各所述对象像素点对应的第二三维点,所述对象像素点为所述样本彩色图像中属于所述待定位对象的像素点;
    对于每一所述对象像素点,确定所述对象像素点对应的偏差表征值,所述偏差表征值为所述对象像素点对应的目标位姿差与所述对象像素点在所述法线图中对应的法线方向之间的乘积,其中,所述目标位姿差为所述对象像素点对应的第一三维点和对应的第二三维点之间的位姿差;
    基于各所述对象像素点对应的偏差表征值,确定所述优化项。
  15. 根据权利要求13或14所述的装置,其中,所述基于所述优化位姿与所述样本初始位姿之间的差异,调整所述位姿估计模型中的网络参数之前,所述调整部分还被配置为:
    判断所述优化位姿是否为预设错误估计位姿;
    响应于所述优化位姿不为所述预设错误估计位姿,执行所述基于所述优化位姿与所述样本初始位姿之间的差异,调整所述位姿估计模型中的网络参数的步骤。
  16. 根据权利要求15所述的装置,其中,所述判断所述优化位姿是否为预设错误估计位姿,包括:
    获取各对象像素点对应的偏差表征值之间的集中趋势表征值,其中,所述对象像素点为所述样本彩色图像中属于所述待定位对象的像素点,所述对象像素点对应的偏差表征值为所述对象像素点对应的目标位姿差与所述对象像素点对应的法线方向之间的乘积,所述目标位姿差为所述对象像素点对应的第一三维点和对应的第二 三维点之间的位姿差,所述第一三维点为所述渲染深度图对应的第一点云中三维点,所述第二三维点为所述样本深度图像对应的第二点云中的三维点;
    判断所述集中趋势表征值是否小于或等于预设尺寸,所述预设尺寸与所述待定位对象在物理世界下的尺寸相关;
    响应于所述集中趋势表征值小于或等于所述预设尺寸,确定所述优化位姿不为所述预设错误估计位姿。
  17. 根据权利要求16所述的装置,其中,所述集中趋势表征值为各对象像素点对应的偏差表征值的平均值。
  18. 根据权利要求12-17任一项所述的装置,其中,所述样本位姿估计部分还被配置为:
    利用所述位姿估计模型,确定所述待定位对象的至少一个三维关键点在所述样本彩色图像上的投影位置;
    基于各所述三维关键点在所述样本彩色图像上的投影位置以及目标相机的内参,确定所述待定位对象的样本初始位姿。
  19. 根据权利要求18所述的装置,其中,所述利用所述位姿估计模型,确定所述待定位对象的至少一个三维关键点在所述样本彩色图像上的投影位置,包括:
    利用所述位姿估计模型,预测各对象像素点分别到每一所述投影位置的方向向量,所述对象像素点为所述样本彩色图像中属于所述待定位对象的像素点;
    对于每一所述投影位置,从与所述投影位置对应的至少一个方向向量中确定预设数量的方向向量,生成各所述方向向量对应的候选投影位置;
    基于各所述候选投影位置之间的位置关系,确定各所述候选投影位置的分数;
    将所述分数满足预设要求的候选投影位置,作为所述投影位置。
  20. 根据权利要求19所述的装置,其中,所述从与所述投影位置对应的至少一个方向向量中确定预设数量的方向向量,生成各所述方向向量对应的候选投影位置,包括:
    将各所述对象像素点的位置与所述对象像素点对应的方向向量求和,得到各所述对象像素点对应的候选投影位置;
    所述基于各所述候选投影位置之间的位置关系,确定各所述候选投影位置的分数,包括:
    对于每一所述候选投影位置,确定所述候选投影位置与其他候选投影位置之间的目标距离的数量,并将所述目标距离的数量作为所述分数,所述目标距离为小于或等于预设距离的距离;
    所述将所述分数满足预设要求的候选投影位置,作为所述投影位置,包括:
    将最大分数对应的候选投影位置作为所述投影位置。
  21. 根据权利要求12-20任一项所述的装置,其中,所述样本位姿估计部分还被配置为:
    利用所述位姿估计模型对所述样本彩色图像进行目标检测,得到所述待定位对象的位置;
    基于所述待定位对象的位置,对所述样本彩色图像进行裁剪,得到包含所述待定位对象的局部图像;
    对所述局部图像进行处理,得到所述待定位对象的样本初始位姿。
  22. 一种位姿估计装置,包括:
    目标图像获取部分,被配置为获取包含待定位对象的目标图像,所述目标图像包括目标彩色图像和所述目标彩色图像对应的目标深度图像;
    目标位姿估计部分,被配置为利用位姿估计模型对所述目标彩色图像进行处理,得到所述待定位对象的目标初始位姿;
    目标位姿优化部分,被配置为基于所述目标深度图像中所述待定位对象的深度信息,对所述目标初始位姿进行优化,得到所述待定位对象的目标位姿;
    其中,所述位姿估计模型是利用权利要求12所述的位姿估计模型的训练装置训练得到的。
  23. 一种电子设备,包括相互耦接的存储器和处理器,所述处理器被配置为执行所述存储器中存储的程序指令,以实现权利要求1至10任一项所述的位姿估计模型的训练方法,或实现权利要求11所述的位姿估计方法。
  24. 一种计算机可读存储介质,其上存储有程序指令,其中,所述程序指令被处理器执行时实现权利要求1至10任一项所述的位姿估计模型的训练方法,或实现权利要求11所述的位姿估计方法。
  25. 一种计算机程序产品,所述计算机程序产品包括计算机程序或指令,在所述计算机程序或指令在电子设备上运行的情况下,使得所述电子设备执行权利要求1至10中任意一项所述的位姿估计模型的训练方法,或实现权利要求11所述的位姿估计方法。
PCT/CN2023/105934 2022-07-12 2023-07-05 位姿估计方法及相关模型的训练方法、装置、电子设备、计算机可读介质和计算机程序产品 WO2024012333A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210823003.X 2022-07-12
CN202210823003.XA CN115131437A (zh) 2022-07-12 2022-07-12 位姿估计方法及相关模型的训练方法、装置、设备、介质

Publications (1)

Publication Number Publication Date
WO2024012333A1 true WO2024012333A1 (zh) 2024-01-18

Family

ID=83383324

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/105934 WO2024012333A1 (zh) 2022-07-12 2023-07-05 位姿估计方法及相关模型的训练方法、装置、电子设备、计算机可读介质和计算机程序产品

Country Status (2)

Country Link
CN (1) CN115131437A (zh)
WO (1) WO2024012333A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115131437A (zh) * 2022-07-12 2022-09-30 浙江商汤科技开发有限公司 位姿估计方法及相关模型的训练方法、装置、设备、介质
WO2024077518A1 (zh) * 2022-10-12 2024-04-18 广州酷狗计算机科技有限公司 基于增强现实的界面显示方法、装置、设备、介质和产品
CN116452638B (zh) * 2023-06-14 2023-09-08 煤炭科学研究总院有限公司 位姿估计模型的训练方法、装置、设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200184668A1 (en) * 2018-12-05 2020-06-11 Qualcomm Incorporated Systems and methods for three-dimensional pose determination
CN112241976A (zh) * 2019-07-19 2021-01-19 杭州海康威视数字技术股份有限公司 一种训练模型的方法及装置
CN112396657A (zh) * 2020-11-25 2021-02-23 河北工程大学 一种基于神经网络的深度位姿估计方法、装置及终端设备
CN112509036A (zh) * 2020-12-01 2021-03-16 北京航空航天大学 位姿估计网络训练及定位方法、装置、设备、存储介质
CN115131437A (zh) * 2022-07-12 2022-09-30 浙江商汤科技开发有限公司 位姿估计方法及相关模型的训练方法、装置、设备、介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200184668A1 (en) * 2018-12-05 2020-06-11 Qualcomm Incorporated Systems and methods for three-dimensional pose determination
CN112241976A (zh) * 2019-07-19 2021-01-19 杭州海康威视数字技术股份有限公司 一种训练模型的方法及装置
CN112396657A (zh) * 2020-11-25 2021-02-23 河北工程大学 一种基于神经网络的深度位姿估计方法、装置及终端设备
CN112509036A (zh) * 2020-12-01 2021-03-16 北京航空航天大学 位姿估计网络训练及定位方法、装置、设备、存储介质
CN115131437A (zh) * 2022-07-12 2022-09-30 浙江商汤科技开发有限公司 位姿估计方法及相关模型的训练方法、装置、设备、介质

Also Published As

Publication number Publication date
CN115131437A (zh) 2022-09-30

Similar Documents

Publication Publication Date Title
WO2024012333A1 (zh) 位姿估计方法及相关模型的训练方法、装置、电子设备、计算机可读介质和计算机程序产品
KR102319177B1 (ko) 이미지 내의 객체 자세를 결정하는 방법 및 장치, 장비, 및 저장 매체
WO2022170844A1 (zh) 一种视频标注方法、装置、设备及计算机可读存储介质
CN110568447B (zh) 视觉定位的方法、装置及计算机可读介质
CN108764048B (zh) 人脸关键点检测方法及装置
CN110675487B (zh) 基于多角度二维人脸的三维人脸建模、识别方法及装置
CN109934065B (zh) 一种用于手势识别的方法和装置
CN109087261B (zh) 基于非受限采集场景的人脸矫正方法
WO2019196476A1 (zh) 基于激光传感器生成地图
US11120535B2 (en) Image processing method, apparatus, terminal, and storage medium
CN112200056B (zh) 人脸活体检测方法、装置、电子设备及存储介质
JP2014032623A (ja) 画像処理装置
JP2017123087A (ja) 連続的な撮影画像に映り込む平面物体の法線ベクトルを算出するプログラム、装置及び方法
CN109934165A (zh) 一种关节点检测方法、装置、存储介质及电子设备
CN113436251B (zh) 一种基于改进的yolo6d算法的位姿估计系统及方法
JP7336653B2 (ja) ディープラーニングを利用した屋内位置測位方法
CN112333468B (zh) 图像处理方法、装置、设备及存储介质
CN110188630A (zh) 一种人脸识别方法和相机
CN111382791B (zh) 深度学习任务处理方法、图像识别任务处理方法和装置
CN110910478B (zh) Gif图生成方法、装置、电子设备及存储介质
WO2020244076A1 (zh) 人脸识别方法、装置、电子设备及存储介质
CN110990604A (zh) 图像底库生成方法、人脸识别方法和智能门禁系统
CN111563895A (zh) 一种图片清晰度确定方法、装置、设备及存储介质
WO2023086398A1 (en) 3d rendering networks based on refractive neural radiance fields
CN111461971B (zh) 图像处理方法、装置、设备及计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23838825

Country of ref document: EP

Kind code of ref document: A1