[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2021035833A1 - 姿态预测方法、模型训练方法及装置 - Google Patents

姿态预测方法、模型训练方法及装置 Download PDF

Info

Publication number
WO2021035833A1
WO2021035833A1 PCT/CN2019/106136 CN2019106136W WO2021035833A1 WO 2021035833 A1 WO2021035833 A1 WO 2021035833A1 CN 2019106136 W CN2019106136 W CN 2019106136W WO 2021035833 A1 WO2021035833 A1 WO 2021035833A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
branch network
translation
target
target object
Prior art date
Application number
PCT/CN2019/106136
Other languages
English (en)
French (fr)
Inventor
季向阳
李志刚
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Priority to DE112019007677.9T priority Critical patent/DE112019007677T5/de
Publication of WO2021035833A1 publication Critical patent/WO2021035833A1/zh
Priority to US17/679,142 priority patent/US11461925B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Definitions

  • the present disclosure relates to the field of computer vision, and in particular to a posture prediction method, model training method and device.
  • Object pose estimation plays a vital role in robot operations, autonomous driving, and augmented reality.
  • Object pose estimation refers to accurately estimating the pose information of the target object relative to the camera from the picture.
  • the posture information usually includes the amount of rotation and the amount of translation, where the amount of rotation can represent the rotation relationship of the camera coordinate system relative to the coordinate system of the target object, and the amount of translation can represent the translation information of the origin of the camera coordinate system relative to the origin of the target object coordinate system.
  • the object pose estimation is easily affected by factors such as occlusion, illumination changes, and the symmetry of the object. It is very challenging to accurately estimate the rotation and translation of the camera relative to the target object. In the related art, it is difficult to estimate the amount of rotation and the amount of translation at the same time with high accuracy.
  • the present disclosure proposes a posture prediction method, model training method and device.
  • a posture prediction method includes: performing target recognition on a first image to be predicted to determine the area where the target object is located; determining the target image according to the area where the target object is located; The target image is input into a posture decoupling prediction model to perform posture prediction.
  • the posture decoupling prediction model includes a basic network, a rotation amount branch network and a translation amount branch network, and the basic network is used to extract features from the target image
  • the rotation amount branch network is used to predict the rotation amount of the target object according to the feature
  • the translation amount branch network is used to predict the translation amount of the target object according to the feature
  • the output result of the rotation amount branch network and the output result of the translation amount branch network determine the rotation amount and the translation amount of the target object.
  • a model training method includes: performing target recognition on a third image to be trained to determine the area where the training object is located; adding scale disturbance and position to the area where the training object is located Perturb to obtain the disturbed object area; intercept the disturbed object area from the third image to obtain a fourth image; while keeping the aspect ratio of the training object unchanged, transform the size of the fourth image Input the required size to the posture decoupling prediction model to obtain a training image; use the training image as the input of the posture decoupling prediction model to train the posture decoupling prediction model.
  • a posture prediction device comprising: a first determination module, configured to perform target recognition on a first image to be predicted to determine the area where the target object is located; and a second determination module, It is used to determine the target image according to the area where the target object is located; the input module is used to input the target image into a posture decoupling prediction model for posture prediction.
  • the posture decoupling prediction model includes a basic network and a rotation branch network A translation amount branch network, the basic network is used to extract features from the target image, the rotation amount branch network is used to predict the rotation amount of the target object based on the characteristics, and the translation amount branch network is used For predicting the translation amount of the target object according to the feature; a third determining module is configured to determine the target object according to the output result of the rotation amount branch network and the output result of the translation amount branch network respectively The amount of rotation and translation.
  • a model training device comprising: a first determination module for performing target recognition on a third image to be trained to determine the area where the training object is located; a disturbance module for Scale disturbance and position disturbance are added to the area where the training object is located to obtain the disturbed object area; an interception module for intercepting the disturbed object area from the third image to obtain a fourth image; a transformation module for While keeping the aspect ratio of the training object unchanged, transform the size of the fourth image to the size required by the posture decoupling prediction model to obtain the training image; the input module is used to use the training image as the training image.
  • the input of the posture decoupling prediction model is used to train the posture decoupling prediction model.
  • a posture prediction device including: a processor; a memory for storing executable instructions of the processor; wherein the processor is configured to execute the method of the first aspect.
  • a non-volatile computer-readable storage medium having computer program instructions stored thereon, wherein the computer program instructions implement the method of the above-mentioned first aspect when the computer program instructions are executed by a processor.
  • a posture prediction device including: a processor; a memory for storing executable instructions of the processor; wherein the processor is configured to execute the method of the second aspect.
  • a non-volatile computer-readable storage medium having computer program instructions stored thereon, where the computer program instructions implement the method of the second aspect when the computer program instructions are executed by a processor.
  • the rotation amount branch network and the translation amount branch network of the attitude decoupling prediction model can respectively predict the rotation amount and the translation amount of the target object, which realizes the decoupling of rotation and translation in the object attitude. It is beneficial to use different strategies to predict the amount of rotation and translation according to the nature of the rotation and translation of the object's attitude, so as to achieve a high-accuracy estimate of the amount of rotation and translation at the same time, and improve the accuracy of attitude prediction.
  • Fig. 1 shows a schematic structural diagram of a posture prediction network according to an embodiment of the present disclosure.
  • Fig. 2 shows a flowchart of a posture prediction method according to an embodiment of the present disclosure.
  • Fig. 3 shows an example of a posture decoupling prediction model according to an embodiment of the present disclosure.
  • Fig. 4 shows a flowchart of a model training method according to an embodiment of the present disclosure.
  • Fig. 5 shows a block diagram of a posture prediction device according to an embodiment of the present disclosure.
  • Fig. 6 shows a block diagram of a model training device according to an embodiment of the present disclosure.
  • Fig. 7 is a block diagram showing a device for posture prediction and model training according to an exemplary embodiment.
  • Object pose estimation methods can be divided into two categories: indirect method and direct method.
  • the prediction model directly predicts the posture information of the object from the image. This method does not need to know the three-dimensional model information of the object, and can quickly estimate the posture of the object.
  • the direct method can get a more accurate translation, but the prediction of the rotation between the object and the camera is not accurate enough.
  • the indirect method the corresponding relationship between the two-dimensional image and the three-dimensional object model needs to be established first, and then the corresponding relationship is solved by a geometric method (such as PnP algorithm, etc.) to obtain the posture information of the object.
  • the indirect method can get a more accurate amount of rotation, but the prediction of the amount of translation between the object and the camera is not accurate enough.
  • the prediction of the translation amount is mainly based on the position and size of the object in the image, and the prediction of the rotation amount mainly depends on the image.
  • the appearance of the object in a posture decoupling prediction model of an object and a posture prediction method based on the object posture decoupling prediction model are proposed.
  • the posture prediction method the rotation and translation in the object posture are treated differently, respectively.
  • the rotation amount and the translation amount are estimated by an appropriate method, so as to improve the accuracy of attitude prediction.
  • Fig. 1 shows a schematic structural diagram of a posture prediction network according to an embodiment of the present disclosure.
  • the pose prediction network may include an object detector and a pose decoupling prediction model.
  • the posture decoupling prediction model may include a basic network, a rotation amount branch network and a translation amount branch network.
  • the object detector is used to detect the area where the target object is located in the input image.
  • the object detector can be any network capable of recognizing the target object, which is not limited in the present disclosure.
  • the basic network is used to extract features from the input image, and the basic network can be any network capable of extracting features from the image, which is not limited in the present disclosure.
  • the basic network may include a first basic network and a second basic network (not shown).
  • the first basic network may extract the first feature from the input image
  • the second basic network may extract the first feature from the input image. Extract the second feature from the image.
  • the terminal may input the first feature into the rotation amount branch network, so that the rotation amount branch network predicts the rotation amount of the target object according to the first feature.
  • the terminal may input the second feature into the translation amount branch network, so that the translation amount branch network predicts the rotation amount of the target object according to the second feature.
  • the network structure of the first basic network and the second basic network may be the same or different, which is not limited in the present disclosure.
  • the first basic network can extract the appearance features of the object in the image as the first feature
  • the second basic network object can extract the location feature and size feature of the object in the image as the second feature.
  • the rotation branch network is used to predict the rotation of the target object based on the features extracted by the basic network.
  • the indirect method is used to predict the rotation amount of the object. As mentioned above, the indirect method needs to establish a correspondence between a two-dimensional image and a three-dimensional model, and then solve the correspondence to obtain the rotation amount R of the target object.
  • the rotation branch network can predict the three-dimensional coordinate value of the point on the corresponding object model for each pixel in the input image that belongs to the target object (referred to as the object three-dimensional coordinate prediction method), In order to establish the corresponding relationship between the image and the object model.
  • the structure of the rotation branch network is compatible with the method of establishing the correspondence between the image and the object model.
  • the translation branch network is used to predict the translation of the target object based on the features extracted by the basic network.
  • the direct method is used to predict the translation amount of the object. As shown before, the direct method can determine the translation amount T according to the output result of the translation amount branch network.
  • Fig. 2 shows a flowchart of a posture prediction method according to an embodiment of the present disclosure.
  • the method can be executed by a terminal such as a notebook, a computer, or a personal assistant, and the method can be applied to the posture prediction network shown in FIG. 1.
  • the method may include:
  • Step S11 Perform target recognition on the first image to be predicted, and determine the area where the target object is located.
  • Step S12 Determine a target image according to the area where the target object is located.
  • Step S13 Input the target image into a posture decoupling prediction model to perform posture prediction.
  • the posture decoupling prediction model includes a basic network, a rotation amount branch network and a translation amount branch network. Image extraction features, the rotation amount branch network is used to predict the rotation amount of the target object based on the feature, and the translation amount branch network is used to predict the translation amount of the target object based on the feature.
  • Step S14 Determine the amount of rotation and translation of the target object according to the output result of the rotation amount branch network and the output result of the translation amount branch network, respectively.
  • the rotation amount branch network and the translation amount branch network of the attitude decoupling prediction model can respectively predict the rotation amount and the translation amount of the target object, which realizes the decoupling of rotation and translation in the object attitude. It is beneficial to use different strategies to predict the amount of rotation and translation according to the nature of the rotation and translation of the object's attitude, so as to achieve a high-accuracy estimate of the amount of rotation and translation at the same time, and improve the accuracy of attitude prediction.
  • the first image represents an image of the pose of the object to be predicted, and the first image includes the target object of the pose to be predicted.
  • the terminal can input the first image into the object detector shown in FIG. 1 to perform target recognition on the first image to obtain the area where the target object is located in the first image.
  • the area where the target object is located is a rectangle.
  • the terminal may determine the target image according to the area where the target object is located. Because the target image needs to be input to the posture decoupling prediction model used later to predict the rotation chain and the translation amount. Therefore, the size of the target image should be consistent with the size required by the input of the pose decoupling prediction model.
  • the terminal can intercept the area where the target object is located from the first image to obtain the second image; while keeping the aspect ratio of the target object unchanged, transform the size of the second image to the pose solution. Coupling the required size of the prediction model input to obtain the target image.
  • the terminal can add zeros around the zoomed image as needed to obtain the target image.
  • the terminal may input the target image acquired in step S12 into the posture decoupling prediction model. Specifically, the terminal can input the target image into the basic network of the posture decoupling prediction model, and the basic network extracts features from the target image; then, the terminal can use the features as the input of the rotation amount branch network and the translation amount branch network respectively, so as to In step S14, the terminal respectively predicts the rotation and translation of the target object according to the output results of the rotation branch network and the translation branch network.
  • the output result of the rotation branch network may include a three-channel object coordinate map and a one-channel object segmentation map.
  • the three channels of the object coordinate map represent the predicted target object in the three-dimensional coordinates.
  • the three-dimensional coordinate values of the position in the system, the object segmentation map is used to segment the target object from the target image.
  • the coordinate values of the three dimensions represent the coordinate values of the x-axis, the y-axis and the z-axis in the three-dimensional coordinate system.
  • the object segmentation map may be a binary image. For example, a position with a pixel value of 0 indicates that the pixel at that position belongs to the target object, and a position with a pixel value of 255 indicates that the pixel at that position does not belong to the target object.
  • determining the rotation amount of the target object according to the output result of the rotation amount branch network in step S14 may include: according to the object coordinate map, the object segmentation map, and the area where the target object is located.
  • the pixel coordinates in the first image determine the amount of rotation of the target object.
  • Fig. 3 shows an example of a posture decoupling prediction model according to an embodiment of the present disclosure.
  • the basic network uses a 34-layer residual convolutional neural network (ResNet34);
  • the rotation branch network includes three feature processing modules and a convolution output layer, and each feature processing module includes a deconvolution layer And two convolutional layers;
  • the translation branch network is composed of six convolutional layers and three fully connected layers.
  • the deconvolution layer can amplify the resolution of the feature, and the convolution layer can process the feature.
  • the terminal inputs a three-channel 256*256 target image into a 34-layer residual convolutional neural network to obtain a 512-channel 8*8 feature. Then the terminal inputs this feature into the rotation branch network.
  • the rotation amount branch network outputs the three-channel 64*64 object coordinate map as And a channel 64*64 object segmentation diagram is Among them, the object coordinate map
  • the three channels respectively represent the predicted coordinate values of the x-axis, y-axis and z-axis of the target object in the three-dimensional coordinate system.
  • the terminal can segment the image according to the object Find the object coordinate map In the area where the target object is located, each pixel in the area is mapped to the first image, so as to establish the correspondence between the pixel coordinates of the target object in the first image and the three-dimensional coordinates of the target object.
  • the terminal can use geometric methods such as PnP algorithm to solve the corresponding relationship, so as to obtain the rotation amount R of the target object.
  • the attitude prediction method improves the accuracy of predicting the amount of rotation compared to the method of uniformly using the direct method to determine the amount of rotation and translation.
  • the output result of the translation branch network includes the scale-invariant translation amount
  • the scale-invariant translation amount is the translation amount of the center of the target object in the target image relative to the center of the target image
  • the scale The constant translation includes three dimensions.
  • determining the translation amount of the target object according to the output result of the translation amount branch network may include: according to the scale-invariant translation amount, the center of the area where the target object is located is in the first The pixel coordinates and size in the image and the parameter information of the camera determine the translation amount of the target object.
  • the terminal inputs the 512-channel 8*8 feature of the target image output into the translation branch network.
  • the translation branch network outputs the scale-invariant translation amount T s
  • the scale-invariant translation amount T s is the translation amount of the center of the target object in the target image relative to the center of the target image
  • the scale-invariant translation amount T s includes with In three dimensions, the scale-invariant translation amount T s is defined as shown in formula (1):
  • O x and O y represent the real pixel coordinates of the center of the target object in the first image
  • C x and Cy represent the pixel coordinates of the center of the area where the target object is located in the first image (that is, the pixel coordinates detected by the object detector).
  • the center of the target object is the pixel coordinate in the first image
  • r represents the area where the target object detected by the object detector is intercepted (corresponding to the second image) and input to the posture decoupling prediction model to be zoomed (corresponding to the target Image), that is, the zoom factor of the second image during the generation process from the second image to the target image.
  • the zoom factor r can be determined according to the size of the second image and the target image.
  • w represents the width of the area where the target object detected by the object detector is located
  • h represents the height of the area where the target object detected by the object detector is located.
  • the terminal can branch the output of the translation network with And the pixel coordinates (C x and Cy ) and size (h and w) of the first image in the center of the area where the target object is located, and the parameter information of the camera (the focal length of the camera on the x axis f x and the focal length of the camera on the y axis) f y ), the final translation amount T (T x , Ty , T z ) of the target object can be obtained, and the combination method is shown in formula (2):
  • the posture prediction method improves the accuracy of predicting the translation amount compared with the method of uniformly adopting the indirect method to determine the rotation amount and the translation amount.
  • Fig. 4 shows a flowchart of a model training method according to an embodiment of the present disclosure.
  • the method can be executed by the terminal, and the method can be used to train the posture decoupling prediction model as shown in FIG. 1 and FIG. 3.
  • the method may include:
  • Step S21 Perform target recognition on the third image to be trained, and determine the area where the training object is located.
  • Step S22 adding scale disturbance and position disturbance to the area where the training object is located, to obtain the disturbed object area.
  • Step S23 intercepting the disturbed object area from the third image to obtain a fourth image.
  • Step S24 while keeping the aspect ratio of the training object unchanged, transform the size of the fourth image to the size required for the input of the posture decoupling prediction model to obtain the training image.
  • Step S25 using the training image as the input of the posture decoupling prediction model to train the posture decoupling prediction model.
  • the sample size is expanded, and on the other hand, the influence of target recognition error on the output result of the attitude decoupling prediction model is reduced, and the attitude decoupling prediction model is improved. Accuracy.
  • the third image may represent a sample image used to train the posture decoupling prediction model, and the third image includes the training object to be predicted during the training process.
  • the terminal may use the object detector shown in FIG. 1 to perform target recognition on the third image to obtain the area where the training object is located.
  • Step S21 can refer to step S11, which will not be repeated here.
  • step S22 the terminal may add scale disturbance and position disturbance to the area where the training object is located to obtain the disturbed object area.
  • the terminal may intercept the disturbed object area from the third image to obtain a fourth image.
  • the pixel coordinates of the center of the area in the first image, h and w respectively represent the height and width of the area where the training object is located.
  • the terminal can re-sample the center and size of the area where the training object is located according to a certain random distribution that depends on C and S, such as truncating the normal distribution function, to obtain the center of the area where the training object is sampled. And maximum size Then follow with To capture the fourth image from the third image.
  • step S24 the terminal may transform the size of the fourth image to the size required by the posture decoupling prediction model input while keeping the aspect ratio of the training object unchanged to obtain the training image.
  • Step S24 can refer to step S12, which will not be repeated here.
  • the terminal may use the training image as an input of the posture decoupling prediction model to train the posture decoupling prediction model.
  • one third image can intercept multiple fourth images, and correspondingly, in step S24, the terminal can obtain multiple training images based on multiple fourth images. That is, based on a third image, multiple training images used to train the posture decoupling prediction model can be obtained.
  • the sample is expanded, and on the other hand, part of the target recognition error is eliminated, the influence of target recognition error on the attitude decoupling prediction model is reduced, and the prediction accuracy of the attitude decoupling prediction model is improved.
  • the attitude decoupling prediction model may include a basic network, a rotation amount branch network, and a translation amount branch network, the basic network is used to extract features from the training image; the rotation amount branch The input of the network is the feature, and the output result of the rotation branch network includes a three-channel object coordinate map and a one-channel object segmentation map.
  • the three channels of the object coordinate map represent the predicted three-dimensional points of the target object.
  • the object segmentation map is used to segment the target object from the training image;
  • the input of the translation branch network is the feature;
  • the output result of the translation branch network includes scale
  • the variable translation amount, the scale-invariant translation amount is the translation amount of the center of the training object in the training image relative to the translation amount in the target image, and the scale-invariant translation amount includes three dimensions.
  • the rotation branch network includes three feature processing modules and one convolution output layer, and each feature processing module includes one deconvolution layer and two convolution layers.
  • the translation amount branch network is formed by stacking six convolutional layers and three fully connected layers.
  • step S13 The above attitude decoupling prediction model, basic network, rotation amount branch network and translation amount branch network can refer to step S13, which will not be repeated here.
  • the method may further include: determining a first loss function according to the object coordinate map and the object segmentation map; and determining a second loss function according to the scale-invariant translation amount;
  • the first loss function is used to train the basic network and the rotation branch network; while the parameters of the basic network and the rotation branch network are fixed, the second loss function is used to train the translation Volume branch network; using the first loss function and the second loss function to simultaneously train the basic network, the rotation branch network, and the translation branch network.
  • the terminal can branch the object coordinate map output by the network according to the amount of rotation And object segmentation map Use formula (3) to determine the first loss function Loss1.
  • ⁇ 1 and ⁇ 2 are the weights of the losses of each part
  • ⁇ 1 is the 1-norm function
  • is the Hadamard product
  • M coor and M conf represent the real object coordinate map and object segmentation map, with Represents the object coordinate map and object segmentation map predicted by the rotation branch network.
  • the terminal can branch the three dimensions of the scale-invariant translation amount output by the network according to the translation amount with Use formula (4) to determine the second loss function Loss2.
  • ⁇ 3 , ⁇ 4 and ⁇ 5 are the loss functions of each part
  • ⁇ 2 is the 2-norm function
  • ⁇ x and Respectively represent the true value and the predicted value of the scale-invariant translation in the x-axis direction
  • ⁇ y and Respectively represent the true value and predicted value of the scale-invariant translation on the y-axis
  • t z and Respectively represent the true value and the predicted value of the scale-invariant translation on the z-axis.
  • the true values M coor , M conf , ⁇ x , ⁇ y and t z in formula (3) and formula (4) can be determined during the process of taking pictures of the training object with the camera.
  • the training images obtained based on the same third image correspond to the same true value.
  • Training stage 1 Use the first loss function to train the basic network and the rotation branch network.
  • Training phase 2 Under the condition of fixing the parameters of the basic network and the parameters of the rotation branch network, the second loss function is used to train the translation branch network.
  • Training stage three use the first loss function and the second loss function to simultaneously train the basic network, the rotation amount branch network and the translation amount branch network.
  • Training phase 1 and training phase 2 can avoid the mutual influence between the rotation branch network and the translation branch network.
  • Training stage three can improve the accuracy of the posture decoupling prediction model.
  • Fig. 5 shows a block diagram of a posture prediction device according to an embodiment of the present disclosure.
  • the device 50 may include:
  • the first determining module 51 is configured to perform target recognition on the first image to be predicted, and determine the area where the target object is located;
  • the second determining module 52 is configured to determine a target image according to the area where the target object is located;
  • the input module 53 is used to input the target image into a posture decoupling prediction model for posture prediction.
  • the posture decoupling prediction model includes a basic network, a rotation amount branch network and a translation amount branch network, and the basic network is used from The target image extracts features, the rotation amount branch network is used to predict the amount of rotation of the target object based on the feature, and the translation amount branch network is used to predict the amount of translation of the target object based on the feature Make predictions
  • the third determining module 54 is configured to determine the rotation amount and the translation amount of the target object according to the output result of the rotation amount branch network and the output result of the translation amount branch network, respectively.
  • the rotation amount branch network and the translation amount branch network of the attitude decoupling prediction model can respectively predict the rotation amount and the translation amount of the target object, which realizes the decoupling of rotation and translation in the object attitude. It is beneficial to use different strategies to predict the amount of rotation and translation according to the nature of the rotation and translation of the object's attitude, so as to achieve a high-accuracy estimate of the amount of rotation and translation at the same time, and improve the accuracy of attitude prediction.
  • the second determining module is further configured to:
  • the size of the second image is transformed to the size required for the input of the posture decoupling prediction model to obtain the target image.
  • the output result of the rotation branch network includes a three-channel object coordinate map and a one-channel object segmentation map.
  • the three channels of the object coordinate map represent the predicted target object in three dimensions.
  • Three-dimensional coordinate values of the location in the coordinate system, and the object segmentation map is used to segment the target object from the target image;
  • the third determining module is also used for:
  • the output result of the translation branch network includes the scale-invariant translation amount
  • the scale-invariant translation amount is the translation amount of the center of the target object in the target image relative to the center of the target image, so
  • the scale-invariant translation includes three dimensions;
  • the third determining module is also used for:
  • Fig. 6 shows a block diagram of a model training device according to an embodiment of the present disclosure.
  • the device 60 may include:
  • the first determining module 61 is configured to perform target recognition on the third image to be trained, and determine the area where the training object is located;
  • the disturbance module 62 is configured to add scale disturbance and position disturbance to the area where the training object is located, to obtain the disturbed object area;
  • the interception module 63 is used to intercept the disturbed object area from the third image to obtain a fourth image
  • the transformation module 64 is configured to transform the size of the fourth image to the size required by the input of the posture decoupling prediction model while keeping the aspect ratio of the training object unchanged to obtain the training image;
  • the input module 65 is configured to use the training image as the input of the posture decoupling prediction model to train the posture decoupling prediction model.
  • the sample size is expanded, and on the other hand, the influence of target recognition error on the output result of the attitude decoupling prediction model is reduced, and the attitude decoupling prediction model is improved. Accuracy.
  • the attitude decoupling prediction model includes a basic network, a rotation amount branch network and a translation amount branch network, the basic network is used to extract features from the training image; the rotation amount branch network The input is the feature, and the output result of the rotation branch network includes a three-channel object coordinate map and a one-channel object segmentation map.
  • the three channels of the object coordinate map represent three of the predicted three-dimensional points of the target object.
  • the object segmentation map is used to segment the target object from the training image; the input of the translation branch network is the feature; the output result of the translation branch network includes scale invariance
  • the amount of translation, the scale-invariant translation amount is the translation amount of the center of the training object in the training image relative to the translation amount in the target image, and the scale-invariant translation amount includes three dimensions.
  • the device further includes:
  • a second determining module configured to determine a first loss function according to the object coordinate map and the object segmentation map
  • a third determining module configured to determine a second loss function according to the scale-invariant translation amount
  • the first training module is configured to train the basic network and the rotation branch network by using the first loss function
  • the second training module is configured to train the translation amount branch network by using a second loss function when the parameters of the basic network and the rotation amount branch network are fixed;
  • the third training module is configured to use the first loss function and the second loss function to simultaneously train the basic network, the rotation amount branch network, and the translation amount branch network.
  • the rotation branch network includes three feature processing modules and one convolution output layer, and each feature processing module includes one deconvolution layer and two convolution layers.
  • the translation amount branch network is formed by stacking six convolutional layers and three fully connected layers.
  • Fig. 7 is a block diagram showing a device 800 for posture prediction and model training according to an exemplary embodiment.
  • the device 800 may be a mobile phone, a computer, a digital broadcasting terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, etc.
  • the device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, And the communication component 816.
  • a processing component 802 a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, And the communication component 816.
  • the processing component 802 generally controls the overall operations of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
  • the processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the foregoing method.
  • the processing component 802 may include one or more modules to facilitate the interaction between the processing component 802 and other components.
  • the processing component 802 may include a multimedia module to facilitate the interaction between the multimedia component 808 and the processing component 802.
  • the memory 804 is configured to store various types of data to support the operation of the device 800. Examples of such data include instructions for any application or method operating on the device 800, contact data, phone book data, messages, pictures, videos, etc.
  • the memory 804 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable and Programmable read only memory (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EPROM erasable and Programmable read only memory
  • PROM programmable read only memory
  • ROM read only memory
  • magnetic memory flash memory
  • flash memory magnetic disk or optical disk.
  • the power supply component 806 provides power to various components of the device 800.
  • the power supply component 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 800.
  • the multimedia component 808 includes a screen that provides an output interface between the device 800 and the user.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touch, sliding, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure related to the touch or slide operation.
  • the multimedia component 808 includes a front camera and/or a rear camera. When the device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
  • the audio component 810 is configured to output and/or input audio signals.
  • the audio component 810 includes a microphone (MIC), and when the device 800 is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal.
  • the received audio signal may be further stored in the memory 804 or transmitted via the communication component 816.
  • the audio component 810 further includes a speaker for outputting audio signals.
  • the I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module.
  • the above-mentioned peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: home button, volume button, start button, and lock button.
  • the sensor component 814 includes one or more sensors for providing the device 800 with various aspects of status assessment.
  • the sensor component 814 can detect the open/close state of the device 800 and the relative positioning of the components.
  • the component is the display and the keypad of the device 800.
  • the sensor component 814 can also detect the position change of the device 800 or a component of the device 800. , The presence or absence of contact between the user and the device 800, the orientation or acceleration/deceleration of the device 800, and the temperature change of the device 800.
  • the sensor component 814 may include a proximity sensor configured to detect the presence of nearby objects when there is no physical contact.
  • the sensor component 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor component 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • the communication component 816 is configured to facilitate wired or wireless communication between the device 800 and other devices.
  • the device 800 can access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof.
  • the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 816 further includes a near field communication (NFC) module to facilitate short-range communication.
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • the apparatus 800 may be implemented by one or more application specific integrated circuits (ASIC), digital signal processors (DSP), digital signal processing equipment (DSPD), programmable logic devices (PLD), field programmable A gate array (FPGA), controller, microcontroller, microprocessor, or other electronic components are implemented to implement the above methods.
  • ASIC application specific integrated circuits
  • DSP digital signal processors
  • DSPD digital signal processing equipment
  • PLD programmable logic devices
  • FPGA field programmable A gate array
  • controller microcontroller, microprocessor, or other electronic components are implemented to implement the above methods.
  • a non-volatile computer-readable storage medium such as a memory 804 including computer program instructions, which can be executed by the processor 820 of the device 800 to complete the foregoing method.
  • the present disclosure may be a system, method and/or computer program product.
  • the computer program product may include a computer-readable storage medium loaded with computer-readable program instructions for enabling a processor to implement various aspects of the present disclosure.
  • the computer-readable storage medium may be a tangible device that can hold and store instructions used by the instruction execution device.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Non-exhaustive list of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM) Or flash memory), static random access memory (SRAM), portable compact disk read-only memory (CD-ROM), digital versatile disk (DVD), memory stick, floppy disk, mechanical encoding device, such as a printer with instructions stored thereon
  • RAM random access memory
  • ROM read-only memory
  • EPROM erasable programmable read-only memory
  • flash memory flash memory
  • SRAM static random access memory
  • CD-ROM compact disk read-only memory
  • DVD digital versatile disk
  • memory stick floppy disk
  • mechanical encoding device such as a printer with instructions stored thereon
  • the computer-readable storage medium used here is not interpreted as the instantaneous signal itself, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (for example, light pulses through fiber optic cables), or through wires Transmission of electrical signals.
  • the computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to various computing/processing devices, or downloaded to an external computer or external storage device via a network, such as the Internet, a local area network, a wide area network, and/or a wireless network.
  • the network may include copper transmission cables, optical fiber transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • the network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network, and forwards the computer-readable program instructions for storage in the computer-readable storage medium in each computing/processing device .
  • the computer program instructions used to perform the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or in one or more programming languages.
  • Source code or object code written in any combination, the programming language includes object-oriented programming languages such as Smalltalk, C++, etc., and conventional procedural programming languages such as "C" language or similar programming languages.
  • Computer-readable program instructions can be executed entirely on the user's computer, partly on the user's computer, executed as a stand-alone software package, partly on the user's computer and partly executed on a remote computer, or entirely on the remote computer or server carried out.
  • the remote computer can be connected to the user's computer through any kind of network-including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (for example, using an Internet service provider to connect to the user's computer) connection).
  • LAN local area network
  • WAN wide area network
  • an electronic circuit such as a programmable logic circuit, a field programmable gate array (FPGA), or a programmable logic array (PLA), can be customized by using the status information of the computer-readable program instructions.
  • FPGA field programmable gate array
  • PDA programmable logic array
  • the computer-readable program instructions are executed to realize various aspects of the present disclosure.
  • These computer-readable program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, thereby producing a machine that makes these instructions when executed by the processor of the computer or other programmable data processing device , A device that implements the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams is produced. It is also possible to store these computer-readable program instructions in a computer-readable storage medium. These instructions make computers, programmable data processing apparatuses, and/or other devices work in a specific manner. Thus, the computer-readable medium storing the instructions includes An article of manufacture, which includes instructions for implementing various aspects of the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.
  • each block in the flowchart or block diagram may represent a module, program segment, or part of an instruction, and the module, program segment, or part of an instruction contains one or more components for realizing the specified logical function.
  • Executable instructions may also occur in a different order than the order marked in the drawings. For example, two consecutive blocks can actually be executed substantially in parallel, or they can sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart can be implemented by a dedicated hardware-based system that performs the specified functions or actions Or it can be realized by a combination of dedicated hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

一种姿态预测方法、模型训练方法及装置,姿态预测方法包括对待预测的第一图像进行目标识别,确定出目标物体所在区域;根据目标物体所在区域,确定目标图像;将目标图像输入姿态解耦预测模型,进行姿态预测;分别根据姿态解耦预测模型的旋转量分支网络的输出结果和平移量分支网络的输出结果,确定目标物体的旋转量和平移量。通过对物体姿态中的旋转和平移进行解耦,能够提高姿态预测的准确性。

Description

姿态预测方法、模型训练方法及装置 技术领域
本公开涉及计算机视觉领域,尤其涉及一种姿态预测方法、模型训练方法及装置。
背景技术
物体姿态估计在机器人作业、自动驾驶、增强现实等方面起着至关重要的作用。物体姿态估计指的是从图片中准确估计出目标物体相对相机的姿态信息。姿态信息通常包括旋转量和平移量,其中旋转量可以表示相机坐标系相对于目标物体坐标系的旋转关系,平移量可以表示相机坐标系原点相对于目标物体坐标系原点的平移信息。
物体姿态估计很容易受到遮挡、光照变化、物体具有的对称性等因素的影响,准确估计出相机相对于目标物体的旋转量和平移量具有很大的挑战性。相关技术中,难以同时对旋转量和平移量达到高准确率的估计。
发明内容
有鉴于此,本公开提出了一种姿态预测方法、模型训练方法及装置。
根据本公开的第一方面,提供了一种姿态预测方法,所述方法包括:对待预测的第一图像进行目标识别,确定出目标物体所在区域;根据所述目标物体所在区域,确定目标图像;将所述目标图像输入姿态解耦预测模型,进行姿态预测,所述姿态解耦预测模型包括基础网络、旋转量分支网络和平移量分支网络,所述基础网络用于从所述目标图像提取特征,所述旋转量分支网络用于根据所述特征对所述目标物体的旋转量进行预测,所述平移量分支网络用于根据所述特征对所述目标物体的平移量进行预测;分别根据所述旋转量分支网络的输出结果和所述平移量分支网络的输出结果,确定所述目标物体的旋转量和平移量。
根据本公开的第二方面,提供了一种模型训练方法,所述方法包括:对待训练的第三图像进行目标识别,确定出训练物体所在区域;对所述训练物体所在区域增加尺度扰动和位置扰动,得到扰动后的物体区域;从所述第三图像中截取扰动后的物体区域,得到第四图像;在保持训练物体长宽比例不变的情况下,将所述第四图像的尺寸变换至姿态解耦预测模型输入所需的尺寸,得到训练图像;将所述训练图像作为所述姿态解耦预测模型的输入,以训练所述姿态解耦预测模型。
根据本公开的第三方面,提供了一种姿态预测装置,所述装置包括:第一确定模块,用于对待预测的第一图像进行目标识别,确定出目标物体所在区域;第二确定模块,用于根据所述目标物体所在区域,确定目标图像;输入模块,用于将所述目标图像输入姿态解耦预测模型,进行姿态预测,所述姿态解耦预测模型包括基础网络、旋转量分支网络和平移量分支网络,所述基础网络用于从所述目标图像提取特征,所述旋转量分支网络用于根据所述特征对所述目标物体的旋转量进行预测,所述平移量分支网络用于根据 所述特征对所述目标物体的平移量进行预测;第三确定模块,用于分别根据所述旋转量分支网络的输出结果和所述平移量分支网络的输出结果,确定所述目标物体的旋转量和平移量。
根据本公开的第四方面,提供了一种模型训练装置,所述装置包括:第一确定模块,用于对待训练的第三图像进行目标识别,确定出训练物体所在区域;扰动模块,用于对所述训练物体所在区域增加尺度扰动和位置扰动,得到扰动后的物体区域;截取模块,用于从所述第三图像中截取扰动后的物体区域,得到第四图像;变换模块,用于在保持训练物体长宽比例不变的情况下,将所述第四图像的尺寸变换至姿态解耦预测模型输入所需的尺寸,得到训练图像;输入模块,用于将所述训练图像作为所述姿态解耦预测模型的输入,以训练所述姿态解耦预测模型。
根据本公开的第五方面,提供了一种姿态预测装置,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为执行上述第一方面的方法。
根据本公开的第六方面,提供了一种非易失性计算机可读存储介质,其上存储有计算机程序指令,其中,所述计算机程序指令被处理器执行时实现上述第一方面的方法。
根据本公开的第七方面,提供了一种姿态预测装置,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为执行上述第二方面的方法。
根据本公开的第八方面,提供了一种非易失性计算机可读存储介质,其上存储有计算机程序指令,其中,所述计算机程序指令被处理器执行时实现上述第二方面的方法。
在本公开实施例中,通过姿态解耦预测模型的旋转量分支网络和平移量分支网络可以分别对目标物体的旋转量和平移量进行预测,这实现了物体姿态中旋转和平移的解耦,有利于针对物体姿态中旋转和平移具有的性质,采用不同的策略对旋转量和平移量进行预测,从而同时对旋转量和平移量达到高准确率的估计,提高姿态预测的准确性。
根据下面参考附图对示例性实施例的详细说明,本公开的其它特征及方面将变得清楚。
附图说明
包含在说明书中并且构成说明书的一部分的附图与说明书一起示出了本公开的示例性实施例、特征和方面,并且用于解释本公开的原理。
图1示出根据本公开一实施例的姿态预测网络的结构示意图。
图2示出根据本公开一实施例的姿态预测方法的流程图。
图3示出根据本公开一实施例的姿态解耦预测模型的一个示例。
图4示出根据本公开一实施例的模型训练方法的流程图。
图5示出根据本公开一实施例的姿态预测装置的框图。
图6示出根据本公开一实施例的模型训练装置的框图。
图7是根据一示例性实施例示出的一种用于姿态预测和模型训练的装置的框图。
具体实施方式
以下将参考附图详细说明本公开的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面,但是除非特别指出,不必按比例绘制附图。
在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。
另外,为了更好的说明本公开,在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解,没有某些具体细节,本公开同样可以实施。在一些实例中,对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述,以便于凸显本公开的主旨。
物体姿态估计方法可以分为间接法和直接法两大类。在直接法中,预测模型直接从图像中预测物体的姿态信息,这种方法无需已知物体的三维模型信息,能够快速估计物体姿态。直接法能够得到较为准确的平移量,但是对物体和相机之间的旋转量的预测不够准确。在间接法中,首先需要建立二维图像和三维物体模型之间的对应关系,然后通过几何法(比如PnP算法等)对对应关系进行求解,得到物体的姿态信息。间接法能够得到较为准确旋转量,但是对物体和相机之间的平移量的预测不够准确。
在本公开实施例中,考虑到物体姿态中旋转和平移在性质上存在较大的差异(比如:平移量的预测主要根据物体在图像中的位置和大小,而旋转量的预测主要依赖于图像中的物体的外观),提出了一种物体姿态解耦预测模型和基于该物体姿态解耦预测模型的姿态预测方法,在该姿态预测方法中将物体姿态中的旋转和平移区别对待,分别针对旋转量和平移量选取合适的方法进行估计,从而提升姿态预测的准确性。
图1示出根据本公开一实施例的姿态预测网络的结构示意图。如图1所示,该姿态预测网络可以包括物体检测器和姿态解耦预测模型。其中,姿态解耦预测模型可以包括基础网络、旋转量分支网络和平移量分支网络。
物体检测器用于检测输入的图像中目标物体所在的区域,物体检测器可以为任何能够对目标物体进行识别的网络,对此本公开不做限制。
基础网络用于从输入的图像中提取特征,基础网络可以为任何能够从图像中提取特征的网络,对此本公开不做限制。
在一种可能的实现方式中,基础网络可以包括第一基础网络和第二基础网络(未示出),第一基础网络可以从输入的图像中提取第一特征,第二基础网络可以从输入的图像中提取第二特征。终端可以将第一特征输入旋转量分支网络,以使旋转量分支网络根据第一特征对目标物体的旋转量进行预测。终端可以将第二特征输入平移量分支网络,以使平移量分支网络根据第二特征对目标物体的旋转量进行预测。第一基础网络和第二基础网络的网络结构可以相同也可以不同,对此本公开不做限制。在一个示例中,第一基 础网络可以提取图像中的物体的外观特征作为第一特征,第二基础网络物体可以提取物体在图像中的位置特征和尺寸特征作为第二特征。这样,通过将拆分后的基础网络分别进行特征提取,有利于后续旋转量分支网络和平移量分支网络根据特征进行预测。
旋转量分支网络用于根据基础网络提取的特征对目标物体的旋转量进行预测。在本公开实施例中,采用间接法预测物体的旋转量。如前所述,间接法需要建立二维图像和三维模型之间的对应关系,再对该对应关系进行求解,以得到目标物体的旋转量R。
在一种可能的实现方式中,旋转量分支网络可以针对输入图像中的每一个属于目标物体的像素,预测其对应的物体模型上的点的三维坐标值(称为物体三维坐标预测法),从而建立图像和物体模型之间的对应关系。
在一种可能的实现方式中,可以先在物体模型上定义若干个关键点,然后采用旋转量分支网络预测这些关键点在物体模型对应的输入图像中的位置信息(称为物体关键点检测法),从而建立图像和物体模型之间的对应关系。
旋转量分支网络的结构与建立图像和物体模型之间的对应关系的方法相适应。
平移量分支网络用于根据基础网络提取的特征对目标物体的平移量进行预测。在本公开实施例中,采用直接法预测物体的平移量。如前所示,直接法可以根据平移量分支网络的输出结果确定平移量T。
图2示出根据本公开一实施例的姿态预测方法的流程图。该方法可以由诸如笔记本、计算机或者个人助理等终端执行,该方法可以应用于图1所示的姿态预测网络中。如图1所示,该方法可以包括:
步骤S11,对待预测的第一图像进行目标识别,确定出目标物体所在区域。
步骤S12,根据所述目标物体所在区域,确定目标图像。
步骤S13,将所述目标图像输入姿态解耦预测模型,进行姿态预测,所述姿态解耦预测模型包括基础网络、旋转量分支网络和平移量分支网络,所述基础网络用于从所述目标图像提取特征,所述旋转量分支网络用于根据所述特征对所述目标物体的旋转量进行预测,所述平移量分支网络用于根据所述特征对所述目标物体的平移量进行预测。
步骤S14,分别根据所述旋转量分支网络的输出结果和所述平移量分支网络的输出结果,确定所述目标物体的旋转量和平移量。
在本公开实施例中,通过姿态解耦预测模型的旋转量分支网络和平移量分支网络可以分别对目标物体的旋转量和平移量进行预测,这实现了物体姿态中旋转和平移的解耦,有利于针对物体姿态中旋转和平移具有的性质,采用不同的策略对旋转量和平移量进行预测,从而同时对旋转量和平移量达到高准确率的估计,提高姿态预测的准确性。
在步骤S11中,第一图像表示待预测物体姿态的图像,第一图像中包括待预测姿态的目标物体。终端可以将第一图像输入图1所示的物体检测器,对第一图像进行目标识别,得到第一图像中目标物体所在的区域。在本公开实施例中,目标物体所在区域为一个矩形。在一个示例中,可以采用目标物体所在区域的中心C在第一图像中的像素坐标(C x, C y)和目标物体所在区域的最大尺寸S来表示目标物体所在区域,其中S=max(h,w),h为目标物体所在区域的高,w为目标物体所在区域的宽。
在步骤S12中,终端可以根据目标物体所在区域,确定目标图像。由于目标图像需要输入后续使用的姿态解耦预测模型来预测旋转链和平移量。因此,目标图像的尺寸应与姿态解耦预测模型的输入所需的尺寸一致。
在一种可能的实现方式中,终端可以从第一图像中截取目标物体所在区域,得到第二图像;在保持目标物体长宽比例不变的情况下,将第二图像的尺寸变换至姿态解耦预测模型输入所需的尺寸,得到目标图像。
需要说明的是,终端可以根据需要在缩放后的图像周围进行补0,从而到的目标图像。
在步骤S13中,终端可以将步骤S12中获取的目标图像输入姿态解耦预测模型。具体的,终端可以将目标图像输入姿态解耦预测模型的基础网络,由基础网络从目标图像中提取特征;之后,终端可以将该特征分别作为旋转量分支网络和平移量分支网络的输入,以便于终端在步骤S14中根据旋转量分支网络和平移量分支网络的输出结果,分别对目标物体的旋转和和平移量分别进行预测。
在一种可能的实现方式中,旋转量分支网络的输出结果可以包括三通道的物体坐标图和一通道的物体分割图,所述物体坐标图的三个通道分别代表预测的目标物体在三维坐标系中所在位置的三个维度的坐标值,所述物体分割图用于从所述目标图像中分割出所述目标物体。
其中,三个维度的坐标值表示三维坐标系中x轴、y轴和z轴的坐标值。物体分割图可以为一个二值图,举例来说,像素值为0的位置表示该位置的像素属于目标物体,像素值为255的位置表示该位置的像素不属于目标物体。
在一种可能的实现方式中,步骤S14中根据旋转量分支网络的输出结果确定所述目标物体的旋转量可以包括:根据所述物体坐标图、所述物体分割图和所述目标物体所在区域在第一图像中的像素坐标,确定所述目标物体的旋转量。
图3示出根据本公开一实施例的姿态解耦预测模型的一个示例。如图3所示,基础网络采用34层的残差卷积神经网络(ResNet34);旋转量分支网络包括三个特征处理模块和一个卷积输出层,每一个特征处理模块包括一个反卷积层和两个卷积层;平移量分支网络由六个卷积层和三个全连接层堆叠而成。其中,旋转量分支网络的特征处理模型中,反卷积层能够放大特征的分辨率,卷积层能够对特征进行处理。
如图3所示,终端将三通道256*256的目标图像输入34层的残差卷积神经网络,得到512通道8*8的特征。然后终端将该特征输入旋转量分支网络。旋转量分支网络输出三通道64*64的物体坐标图为
Figure PCTCN2019106136-appb-000001
和一通道64*64的物体分割图为
Figure PCTCN2019106136-appb-000002
其中,物体坐标图
Figure PCTCN2019106136-appb-000003
的三个通道分别表示预测到的目标物体在三维坐标系中x轴、y轴和z轴的坐标值。之后,如图1所示,终端可以根据物体分割图
Figure PCTCN2019106136-appb-000004
找到物体坐标图
Figure PCTCN2019106136-appb-000005
中目标物体所在区域,将该区域中的每个像素对应到第一图像中,从而建立目标物体在第一图像中的像 素坐标和目标物体的三维坐标的对应关系。最后,终端可以采用PnP算法等几何法对对应关系进行求解,从而得到目标物体的旋转量R。
由于间接法能够得到较为准确的旋转量,因此根据本公开实施例的姿态预测方法相较于统一采用直接法确定旋转量和平移量的方法,提高了预测旋转量的准确性。
在一种可能的实现方式中,平移量分支网络的输出结果包括尺度不变平移量,所述尺度不变平移量为目标图像中目标物体的中心相对于目标图像中心的平移量,所述尺度不变平移量包括三个维度。
在一种可能的实现方式中,根据所述平移量分支网络的输出结果确定所述目标物体的平移量可以包括:根据所述尺度不变平移量、所述目标物体所在区域的中心在第一图像中的像素坐标和尺寸以及相机的参数信息,确定所述目标物体的平移量。
如图3所示,终端将目标图像输出的512通道8*8的特征输入平移量分支网络。平移量分支网络输出尺度不变平移量T s,所述尺度不变平移量T s为目标图像中目标物体的中心相对于目标图像中心的平移量,该尺度不变平移量T s包括
Figure PCTCN2019106136-appb-000006
Figure PCTCN2019106136-appb-000007
三个维度,尺度不变平移量T s定义如公式(1)所示:
Figure PCTCN2019106136-appb-000008
其中,O x和O y表示目标物体的中心在第一图像中的真实像素坐标,C x和C y表示目标物体所在区域的中心在第一图像中的像素坐标(即物体检测器检测到的目标物体的中心在第一图像中的像素坐标),r表示将物体检测器检测到的目标物体所在区域截取出来(对应第二图像)输入给姿态解耦预测模型时需要缩放的倍数(对应目标图像),即由第二图像到目标图像的产生过程中,第二图像的缩放倍数。该缩放倍数r可以根据第二图像和目标图像的尺寸确定。w表示物体检测器检测到的目标物体所在区域的宽,h表示物体检测器检测到的目标物体所在区域的高。
如图1所示,终端可以将平移量分支网络的输出的
Figure PCTCN2019106136-appb-000009
Figure PCTCN2019106136-appb-000010
和目标物体所在区域的中心在第一图像中的像素坐标(C x和C y)和尺寸(h和w)以及相机的参数信息(相机在x轴的焦距f x和相机在y轴的焦距f y)相结合,即可得到目标物体最终平移量T(T x,T y,T z),结合方式如公式(2)所示:
Figure PCTCN2019106136-appb-000011
由于直接法能够得到较为准确的平移量,因此根据本公开实施例的姿态预测方法相较于统一采用间接法确定旋转量和平移量的方法,提高了预测平移量的准确性。
图4示出根据本公开一实施例的模型训练方法的流程图。该方法可以由终端执行,该方法可以用于训练如图1和图3所示的姿态解耦预测模型。如图4所示,该方法可以包括:
步骤S21,对待训练的第三图像进行目标识别,确定出训练物体所在区域。
步骤S22,对所述训练物体所在区域增加尺度扰动和位置扰动,得到扰动后的物体区域。
步骤S23,从所述第三图像中截取扰动后的物体区域,得到第四图像。
步骤S24,在保持训练物体长宽比例不变的情况下,将所述第四图像的尺寸变换至姿态解耦预测模型输入所需的尺寸,得到训练图像。
步骤S25,将所述训练图像作为所述姿态解耦预测模型的输入,以训练所述姿态解耦预测模型。
在本公开实施例中,通过对目标识别的结果增加扰动,一方面扩张了样本量,另一方面降低了目标识别的误差对姿态解耦预测模型输出结果的影响,提升了姿态解耦预测模型的准确性。
在步骤S21中,第三图像可以表示用来训练姿态解耦预测模型的样本图像,第三图像中包括训练过程中要预测的训练物体。终端可以采用图1所示的物体检测器对第三图像进行目标识别,得到训练物体所在的区域。步骤S21可以参照步骤S11,这里不再赘述。
在步骤S22中,终端可以对训练物体所在区域增加尺度扰动和位置扰动,得到扰动后的物体区域。在步骤S23中,终端可以从所述第三图像中截取扰动后的物体区域,得到第四图像。
在一个示例中,假设训练物体所在区域的中心和最大尺寸分别为C和s,其中C=(C x,C y),s=max(h,w),C x和C y表示训练物体所在区域的中心在第一图像中的像素坐标,h和w分别表示训练物体所在区域的高和宽。终端可以根据依赖C和S的某一随机分布,例如截断正态分布函数,重新对训练物体所在区域的中心和尺寸进行采样,得到采样后的训练物体所在区域的中心
Figure PCTCN2019106136-appb-000012
和最大尺寸
Figure PCTCN2019106136-appb-000013
然后按照
Figure PCTCN2019106136-appb-000014
Figure PCTCN2019106136-appb-000015
对从第三图像中截取第四图像。
在步骤S24中,终端可以在保持训练物体长宽比例不变的情况下,将所述第四图像的尺寸变换至所述姿态解耦预测模型输入所需的尺寸,得到训练图像。步骤S24可以参照步骤S12,这里不再赘述。
在步骤S25中,终端可以将所述训练图像作为所述姿态解耦预测模型的输入,以训练所述姿态解耦预测模型。
可以理解的是,通过步骤S22和步骤S23一个第三图像可以截取出多个第四图像,相应的在步骤S24中终端基于多个第四图像就可以得到多个训练图像。也就是说,基于一个第三图像可以得到多个用来训练姿态解耦预测模型的训练图像。这样,一方面扩张了样本,另一方面消除了一部分目标识别的误差,降低了目标识别的误差对姿态解耦预测模 型的影响,提高了姿态解耦预测模型预测的准确性。
在一种可能的实现方式中,所述姿态解耦预测模型可以包括基础网络、旋转量分支网络和平移量分支网络,所述基础网络用于从所述训练图像提取特征;所述旋转量分支网络的输入为所述特征,所述旋转量分支网络的输出结果包括三通道的物体坐标图和一通道的物体分割图,所述物体坐标图的三个通道分别代表预测的目标物体三维点的三个坐标值,所述物体分割图用于从所述训练图像中分割出所述目标物体;所述平移量分支网络的输入为所述特征;所述平移量分支网络的输出结果包括尺度不变平移量,所述尺度不变平移量为训练图像中训练物体的中心相对于目标图像中的平移量,所述尺度不变平移量包括三个维度。
在一种可能的实现方式中,所述旋转量分支网络包括三个特征处理模块和一个卷积输出层,每一个特征处理模块包括一个反卷积层和两个卷积层。
在一种可能的实现方式中,所述平移量分支网络由六个卷积层和三个全连接层堆叠而成。
以上姿态解耦预测模型、基础网络、旋转量分支网络和平移量分支网络可以参照步骤S13,这里不再赘述。
在一种可能的实现方式中,所述方法还可以包括:根据所述物体坐标图和所述物体分割图,确定第一损失函数;根据所述尺度不变平移量,确定第二损失函数;采用所述第一损失函数训练所述基础网络和所述旋转量分支网络;在固定所述基础网络的参数和所述旋转量分支网络的参数的情况下,采用第二损失函数训练所述平移量分支网络;采用所述第一损失函数和所述第二损失函数对所述基础网络、旋转量分支网络和所述平移量分支网络同时进行训练。
终端可以根据旋转量分支网络输出的物体坐标图
Figure PCTCN2019106136-appb-000016
和物体分割图
Figure PCTCN2019106136-appb-000017
采用公式(3)确定第一损失函数Loss1。
Figure PCTCN2019106136-appb-000018
其中,η 1和η 2为各部分损失的权重,ι 1为1范数函数,ο为哈达玛乘积,n c=3为物体坐标图M coor的通道数。M coor和M conf表示真实的物体坐标图和物体分割图,
Figure PCTCN2019106136-appb-000019
Figure PCTCN2019106136-appb-000020
表示通过旋转量分支网络预测的物体坐标图和物体分割图。
终端可以根据平移量分支网络输出的尺度不变平移量的三个维度
Figure PCTCN2019106136-appb-000021
Figure PCTCN2019106136-appb-000022
采用公式(4)确定第二损失函数Loss2。
Figure PCTCN2019106136-appb-000023
其中,η 3、η 4和η 5为各部分损失函数,ι 2为2范数函数,Δ x
Figure PCTCN2019106136-appb-000024
分别代表尺度不变平 移量在x轴方向的真实值和预测值,Δ y
Figure PCTCN2019106136-appb-000025
分别代表尺度不变平移量在y轴上的真实值和预测值,t z
Figure PCTCN2019106136-appb-000026
分别代表尺度不变平移量在z轴上的真实值和预测值。
需要说明的是公式(3)和公式(4)中的真实值M coor、M conf、Δ x、Δ y和t z可以在使用相机对训练物体进行拍照的过程中确定。基于同一个第三图像得到的各训练图像之间对应相同的真实值。
在确定第一损失函数和第二损失函数之后,终端可以对姿态解耦预测模型进行三个阶段的训练。训练阶段一:采用第一损失函数训练基础网络和旋转量分支网络。训练阶段二:在固定基础网络的参数和旋转量分支网络的参数的情况下,采用第二损失函数训练平移量分支网络。训练阶段三:采用第一损失函数和第二损失函数对基础网络、旋转量分支网络和平移量分支网络同时进行训练。
训练阶段一和训练阶段二可以避免旋转量分支网络和平移量分支网络之间的相互影响。训练阶段三,可以提升姿态解耦预测模型的准确性。
图5示出根据本公开一实施例的姿态预测装置的框图。如图5所示,该装置50可以包括:
第一确定模块51,用于对待预测的第一图像进行目标识别,确定出目标物体所在区域;
第二确定模块52,用于根据所述目标物体所在区域,确定目标图像;
输入模块53,用于将所述目标图像输入姿态解耦预测模型,进行姿态预测,所述姿态解耦预测模型包括基础网络、旋转量分支网络和平移量分支网络,所述基础网络用于从所述目标图像提取特征,所述旋转量分支网络用于根据所述特征对所述目标物体的旋转量进行预测,所述平移量分支网络用于根据所述特征对所述目标物体的平移量进行预测;
第三确定模块54,用于分别根据所述旋转量分支网络的输出结果和所述平移量分支网络的输出结果,确定所述目标物体的旋转量和平移量。
在本公开实施例中,通过姿态解耦预测模型的旋转量分支网络和平移量分支网络可以分别对目标物体的旋转量和平移量进行预测,这实现了物体姿态中旋转和平移的解耦,有利于针对物体姿态中旋转和平移具有的性质,采用不同的策略对旋转量和平移量进行预测,从而同时对旋转量和平移量达到高准确率的估计,提高姿态预测的准确性。
在一种可能的实现方式中,所述第二确定模块还用于:
从所述第一图像中截取所述目标物体所在区域,得到第二图像;
在保持目标物体长宽比例不变的情况下,将第二图像的尺寸变换至所述姿态解耦预测模型输入所需的尺寸,得到所述目标图像。
在一种可能的实现方式中,所述旋转量分支网络的输出结果包括三通道的物体坐标图和一通道的物体分割图,所述物体坐标图的三个通道分别代表预测的目标物体在三维 坐标系中所在位置的三个维度的坐标值,所述物体分割图用于从所述目标图像中分割出所述目标物体;
所述第三确定模块还用于:
根据所述物体坐标图、所述物体分割图和所述目标物体所在区域在第一图像中的像素坐标,确定所述目标物体的旋转量。
在一种可能的实现方式中,所述平移量分支网络的输出结果包括尺度不变平移量,所述尺度不变平移量为目标图像中目标物体的中心相对于目标图像中心的平移量,所述尺度不变平移量包括三个维度;
所述第三确定模块还用于:
根据所述尺度不变平移量、所述目标物体所在区域的中心在第一图像中的像素坐标和尺寸以及相机的参数信息,确定所述目标物体的平移量。
图6示出根据本公开一实施例的模型训练装置的框图。如图6所示,该装置60可以包括:
第一确定模块61,用于对待训练的第三图像进行目标识别,确定出训练物体所在区域;
扰动模块62,用于对所述训练物体所在区域增加尺度扰动和位置扰动,得到扰动后的物体区域;
截取模块63,用于从所述第三图像中截取扰动后的物体区域,得到第四图像;
变换模块64,用于在保持训练物体长宽比例不变的情况下,将所述第四图像的尺寸变换至姿态解耦预测模型输入所需的尺寸,得到训练图像;
输入模块65,用于将所述训练图像作为所述姿态解耦预测模型的输入,以训练所述姿态解耦预测模型。
在本公开实施例中,通过对目标识别的结果增加扰动,一方面扩张了样本量,另一方面降低了目标识别的误差对姿态解耦预测模型输出结果的影响,提升了姿态解耦预测模型的准确性。
在一种可能的实现方式中,所述姿态解耦预测模型包括基础网络、旋转量分支网络和平移量分支网络,所述基础网络用于从所述训练图像提取特征;所述旋转量分支网络的输入为所述特征,所述旋转量分支网络的输出结果包括三通道的物体坐标图和一通道的物体分割图,所述物体坐标图的三个通道分别代表预测的目标物体三维点的三个坐标值,所述物体分割图用于从所述训练图像中分割出所述目标物体;所述平移量分支网络的输入为所述特征;所述平移量分支网络的输出结果包括尺度不变平移量,所述尺度不变平移量为训练图像中训练物体的中心相对于目标图像中的平移量,所述尺度不变平移量包括三个维度。
在一种可能的实现方式中,所述装置还包括:
第二确定模块,用于根据所述物体坐标图和所述物体分割图,确定第一损失函数;
第三确定模块,用于根据所述尺度不变平移量,确定第二损失函数;
第一训练模块,用于采用所述第一损失函数训练所述基础网络和所述旋转量分支网络;
第二训练模块,用于在固定所述基础网络的参数和所述旋转量分支网络的参数的情况下,采用第二损失函数训练所述平移量分支网络;
第三训练模块,用于采用所述第一损失函数和所述第二损失函数对所述基础网络、旋转量分支网络和所述平移量分支网络同时进行训练。
在一种可能的实现方式中,所述旋转量分支网络包括三个特征处理模块和一个卷积输出层,每一个特征处理模块包括一个反卷积层和两个卷积层。
在一种可能的实现方式中,所述平移量分支网络由六个卷积层和三个全连接层堆叠而成。
图7是根据一示例性实施例示出的一种用于姿态预测和模型训练的装置800的框图。例如,装置800可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等。
参照图7,装置800可以包括以下一个或多个组件:处理组件802,存储器804,电源组件806,多媒体组件808,音频组件810,输入/输出(I/O)的接口812,传感器组件814,以及通信组件816。
处理组件802通常控制装置800的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件802可以包括一个或多个处理器820来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件802可以包括一个或多个模块,便于处理组件802和其他组件之间的交互。例如,处理组件802可以包括多媒体模块,以方便多媒体组件808和处理组件802之间的交互。
存储器804被配置为存储各种类型的数据以支持在装置800的操作。这些数据的示例包括用于在装置800上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器804可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
电源组件806为装置800的各种组件提供电力。电源组件806可以包括电源管理系统,一个或多个电源,及其他与为装置800生成、管理和分配电力相关联的组件。
多媒体组件808包括在所述装置800和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件808包括一个前置摄像头和/或后置摄像头。当装置800处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每 个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。
音频组件810被配置为输出和/或输入音频信号。例如,音频组件810包括一个麦克风(MIC),当装置800处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器804或经由通信组件816发送。在一些实施例中,音频组件810还包括一个扬声器,用于输出音频信号。
I/O接口812为处理组件802和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器组件814包括一个或多个传感器,用于为装置800提供各个方面的状态评估。例如,传感器组件814可以检测到装置800的打开/关闭状态,组件的相对定位,例如所述组件为装置800的显示器和小键盘,传感器组件814还可以检测装置800或装置800一个组件的位置改变,用户与装置800接触的存在或不存在,装置800方位或加速/减速和装置800的温度变化。传感器组件814可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件814还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件814还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件816被配置为便于装置800和其他设备之间有线或无线方式的通信。装置800可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件816经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件816还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
在示例性实施例中,装置800可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。
在示例性实施例中,还提供了一种非易失性计算机可读存储介质,例如包括计算机程序指令的存储器804,上述计算机程序指令可由装置800的处理器820执行以完成上述方法。
本公开可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现本公开的各个方面的计算机可读程序指令。
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器 (SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。
用于执行本公开操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA),该电子电路可以执行计算机可读程序指令,从而实现本公开的各个方面。
这里参照根据本公开实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上 执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本公开的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
以上已经描述了本公开的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的技术改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。

Claims (22)

  1. 一种姿态预测方法,其特征在于,所述方法包括:
    对待预测的第一图像进行目标识别,确定出目标物体所在区域;
    根据所述目标物体所在区域,确定目标图像;
    将所述目标图像输入姿态解耦预测模型,进行姿态预测,所述姿态解耦预测模型包括基础网络、旋转量分支网络和平移量分支网络,所述基础网络用于从所述目标图像提取特征,所述旋转量分支网络用于根据所述特征对所述目标物体的旋转量进行预测,所述平移量分支网络用于根据所述特征对所述目标物体的平移量进行预测;
    分别根据所述旋转量分支网络的输出结果和所述平移量分支网络的输出结果,确定所述目标物体的旋转量和平移量。
  2. 根据权利要求1所述的方法,其特征在于,根据所述目标物体所在的区域,确定目标图像包括:
    从所述第一图像中截取所述目标物体所在区域,得到第二图像;
    在保持目标物体长宽比例不变的情况下,将第二图像的尺寸变换至所述姿态解耦预测模型输入所需的尺寸,得到所述目标图像。
  3. 根据权利要求1所述的方法,其特征在于,所述旋转量分支网络的输出结果包括三通道的物体坐标图和一通道的物体分割图,所述物体坐标图的三个通道分别代表预测的目标物体在三维坐标系中所在位置的三个维度的坐标值,所述物体分割图用于从所述目标图像中分割出所述目标物体;
    根据所述旋转量分支网络的输出结果确定所述目标物体的旋转量包括:
    根据所述物体坐标图、所述物体分割图和所述目标物体所在区域在第一图像中的像素坐标,确定所述目标物体的旋转量。
  4. 根据权利要求1所述的方法,其特征在于,所述平移量分支网络的输出结果包括尺度不变平移量,所述尺度不变平移量为目标图像中目标物体的中心相对于目标图像中心的平移量,所述尺度不变平移量包括三个维度;
    根据所述平移量分支网络的输出结果确定所述目标物体的平移量包括:
    根据所述尺度不变平移量、所述目标物体所在区域的中心在第一图像中的像素坐标和尺寸以及相机的参数信息,确定所述目标物体的平移量。
  5. 一种模型训练方法,其特征在于,所述方法包括:
    对待训练的第三图像进行目标识别,确定出训练物体所在区域;
    对所述训练物体所在区域增加尺度扰动和位置扰动,得到扰动后的物体区域;
    从所述第三图像中截取扰动后的物体区域,得到第四图像;
    在保持训练物体长宽比例不变的情况下,将所述第四图像的尺寸变换至姿态解耦预测模型输入所需的尺寸,得到训练图像;
    将所述训练图像作为所述姿态解耦预测模型的输入,以训练所述姿态解耦预测模型。
  6. 根据权利要求5所述的方法,其特征在于,所述姿态解耦预测模型包括基础网络、旋转量分支网络和平移量分支网络,所述基础网络用于从所述训练图像提取特征;所述 旋转量分支网络的输入为所述特征,所述旋转量分支网络的输出结果包括三通道的物体坐标图和一通道的物体分割图,所述物体坐标图的三个通道分别代表预测的目标物体三维点的三个坐标值,所述物体分割图用于从所述训练图像中分割出所述目标物体;所述平移量分支网络的输入为所述特征;所述平移量分支网络的输出结果包括尺度不变平移量,所述尺度不变平移量为训练图像中训练物体的中心相对于目标图像中的平移量,所述尺度不变平移量包括三个维度。
  7. 根据权利要求6所述的方法,其特征在于,所述方法还包括:
    根据所述物体坐标图和所述物体分割图,确定第一损失函数;
    根据所述尺度不变平移量,确定第二损失函数;
    采用所述第一损失函数训练所述基础网络和所述旋转量分支网络;
    在固定所述基础网络的参数和所述旋转量分支网络的参数的情况下,采用第二损失函数训练所述平移量分支网络;
    采用所述第一损失函数和所述第二损失函数对所述基础网络、旋转量分支网络和所述平移量分支网络同时进行训练。
  8. 根据权利要求6所述的方法,其特征在于,所述旋转量分支网络包括三个特征处理模块和一个卷积输出层,每一个特征处理模块包括一个反卷积层和两个卷积层。
  9. 根据权利要求6所述的方法,其特征在于,所述平移量分支网络由六个卷积层和三个全连接层堆叠而成。
  10. 一种姿态预测装置,其特征在于,所述装置包括:
    第一确定模块,用于对待预测的第一图像进行目标识别,确定出目标物体所在区域;
    第二确定模块,用于根据所述目标物体所在区域,确定目标图像;
    输入模块,用于将所述目标图像输入姿态解耦预测模型,进行姿态预测,所述姿态解耦预测模型包括基础网络、旋转量分支网络和平移量分支网络,所述基础网络用于从所述目标图像提取特征,所述旋转量分支网络用于根据所述特征对所述目标物体的旋转量进行预测,所述平移量分支网络用于根据所述特征对所述目标物体的平移量进行预测;
    第三确定模块,用于分别根据所述旋转量分支网络的输出结果和所述平移量分支网络的输出结果,确定所述目标物体的旋转量和平移量。
  11. 根据权利要求10所述的装置,其特征在于,所述第二确定模块还用于:
    从所述第一图像中截取所述目标物体所在区域,得到第二图像;
    在保持目标物体长宽比例不变的情况下,将第二图像的尺寸变换至所述姿态解耦预测模型输入所需的尺寸,得到所述目标图像。
  12. 根据权利要求10所述的装置,其特征在于,所述旋转量分支网络的输出结果包括三通道的物体坐标图和一通道的物体分割图,所述物体坐标图的三个通道分别代表预测的目标物体在三维坐标系中所在位置的三个维度的坐标值,所述物体分割图用于从所述目标图像中分割出所述目标物体;
    所述第三确定模块还用于:
    根据所述物体坐标图、所述物体分割图和所述目标物体所在区域在第一图像中的像素坐标,确定所述目标物体的旋转量。
  13. 根据权利要求10所述的装置,其特征在于,所述平移量分支网络的输出结果包括尺度不变平移量,所述尺度不变平移量为目标图像中目标物体的中心相对于目标图像中心的平移量,所述尺度不变平移量包括三个维度;
    所述第三确定模块还用于:
    根据所述尺度不变平移量、所述目标物体所在区域的中心在第一图像中的像素坐标和尺寸以及相机的参数信息,确定所述目标物体的平移量。
  14. 一种模型训练装置,其特征在于,所述装置包括:
    第一确定模块,用于对待训练的第三图像进行目标识别,确定出训练物体所在区域;
    扰动模块,用于对所述训练物体所在区域增加尺度扰动和位置扰动,得到扰动后的物体区域;
    截取模块,用于从所述第三图像中截取扰动后的物体区域,得到第四图像;
    变换模块,用于在保持训练物体长宽比例不变的情况下,将所述第四图像的尺寸变换至姿态解耦预测模型输入所需的尺寸,得到训练图像;
    输入模块,用于将所述训练图像作为所述姿态解耦预测模型的输入,以训练所述姿态解耦预测模型。
  15. 根据权利要求14所述的装置,其特征在于,所述姿态解耦预测模型包括基础网络、旋转量分支网络和平移量分支网络,所述基础网络用于从所述训练图像提取特征;所述旋转量分支网络的输入为所述特征,所述旋转量分支网络的输出结果包括三通道的物体坐标图和一通道的物体分割图,所述物体坐标图的三个通道分别代表预测的目标物体三维点的三个坐标值,所述物体分割图用于从所述训练图像中分割出所述目标物体;所述平移量分支网络的输入为所述特征;所述平移量分支网络的输出结果包括尺度不变平移量,所述尺度不变平移量为训练图像中训练物体的中心相对于目标图像中的平移量,所述尺度不变平移量包括三个维度。
  16. 根据权利要求15所述的装置,其特征在于,所述装置还包括:
    第二确定模块,用于根据所述物体坐标图和所述物体分割图,确定第一损失函数;
    第三确定模块,用于根据所述尺度不变平移量,确定第二损失函数;
    第一训练模块,用于采用所述第一损失函数训练所述基础网络和所述旋转量分支网络;
    第二训练模块,用于在固定所述基础网络的参数和所述旋转量分支网络的参数的情况下,采用第二损失函数训练所述平移量分支网络;
    第三训练模块,用于采用所述第一损失函数和所述第二损失函数对所述基础网络、旋转量分支网络和所述平移量分支网络同时进行训练。
  17. 根据权利要求15所述的装置,其特征在于,所述旋转量分支网络包括三个特征处理模块和一个卷积输出层,每一个特征处理模块包括一个反卷积层和两个卷积层。
  18. 根据权利要求15所述的装置,其特征在于,所述平移量分支网络由六个卷积层和三个全连接层堆叠而成。
  19. 一种姿态预测装置,其特征在于,包括:
    处理器;
    用于存储处理器可执行指令的存储器;
    其中,所述处理器被配置为:执行权利要求1至4中任意一项所述的方法。
  20. 一种非易失性计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现权利要求1至4中任意一项所述的方法。
  21. 一种模型训练装置,其特征在于,包括:
    处理器;
    用于存储处理器可执行指令的存储器;
    其中,所述处理器被配置为:执行权利要求5至9中任意一项所述的方法。
  22. 一种非易失性计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现权利要求5至9中任意一项所述的方法。
PCT/CN2019/106136 2019-08-30 2019-09-17 姿态预测方法、模型训练方法及装置 WO2021035833A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
DE112019007677.9T DE112019007677T5 (de) 2019-08-30 2019-09-17 Posenvorhersageverfahren, Modelltrainingsverfahren und Vorrichtung
US17/679,142 US11461925B2 (en) 2019-08-30 2022-02-24 Pose prediction method and apparatus, and model training method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910815666.5A CN110503689B (zh) 2019-08-30 2019-08-30 位姿预测方法、模型训练方法及装置
CN201910815666.5 2019-08-30

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/679,142 Continuation US11461925B2 (en) 2019-08-30 2022-02-24 Pose prediction method and apparatus, and model training method and apparatus

Publications (1)

Publication Number Publication Date
WO2021035833A1 true WO2021035833A1 (zh) 2021-03-04

Family

ID=68590606

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/106136 WO2021035833A1 (zh) 2019-08-30 2019-09-17 姿态预测方法、模型训练方法及装置

Country Status (4)

Country Link
US (1) US11461925B2 (zh)
CN (1) CN110503689B (zh)
DE (1) DE112019007677T5 (zh)
WO (1) WO2021035833A1 (zh)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111462239B (zh) * 2020-04-03 2023-04-14 清华大学 姿态编码器训练及姿态估计方法及装置
CN111680623B (zh) * 2020-06-05 2023-04-21 北京百度网讯科技有限公司 姿态转换方法及装置、电子设备、存储介质
CN111784772B (zh) * 2020-07-02 2022-12-02 清华大学 基于域随机化的姿态估计模型训练方法及装置
CN112529073A (zh) * 2020-12-07 2021-03-19 北京百度网讯科技有限公司 模型训练方法、姿态估计方法、装置及电子设备
CN112598728B (zh) * 2020-12-23 2024-02-13 极米科技股份有限公司 投影仪姿态估计、梯形校正方法、装置、投影仪及介质
CN112991445B (zh) * 2021-03-03 2023-10-24 网易(杭州)网络有限公司 模型训练方法、姿态预测方法、装置、设备及存储介质
CN114842078A (zh) * 2022-04-14 2022-08-02 中国人民解放军战略支援部队航天工程大学 一种基于深度学习的双通道卫星姿态估计网络
CN114998583B (zh) * 2022-05-11 2024-07-16 平安科技(深圳)有限公司 图像处理方法、图像处理装置、设备及存储介质
CN114969967B (zh) * 2022-05-19 2024-08-13 北京百度网讯科技有限公司 交通工具绕流的模拟计算方法、模拟计算模型的训练方法
CN117455983B (zh) * 2023-12-26 2024-04-12 深圳市亿境虚拟现实技术有限公司 Vr手柄空间定位方法、装置、电子设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284681A (zh) * 2018-08-20 2019-01-29 北京市商汤科技开发有限公司 位姿检测方法及装置、电子设备和存储介质
CN109615655A (zh) * 2018-11-16 2019-04-12 深圳市商汤科技有限公司 一种确定物体姿态的方法及装置、电子设备及计算机介质
US20190200919A1 (en) * 2016-06-23 2019-07-04 Nec Solution Innovators, Ltd. Posture analysis device, posture analysis method, and computer-readable recording medium
CN110119148A (zh) * 2019-05-14 2019-08-13 深圳大学 一种六自由度姿态估计方法、装置及计算机可读存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106875446B (zh) * 2017-02-20 2019-09-20 清华大学 相机重定位方法及装置
CN109215080B (zh) * 2018-09-25 2020-08-11 清华大学 基于深度学习迭代匹配的6d姿态估计网络训练方法及装置
WO2020099338A1 (en) * 2018-11-16 2020-05-22 Kokkinos Iason Three-dimensional object reconstruction
CN109815800A (zh) * 2018-12-17 2019-05-28 广东电网有限责任公司 基于回归算法的目标检测方法及系统
US11782001B2 (en) * 2020-12-04 2023-10-10 Attolight AG Dislocation type and density discrimination in semiconductor materials using cathodoluminescence measurements

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190200919A1 (en) * 2016-06-23 2019-07-04 Nec Solution Innovators, Ltd. Posture analysis device, posture analysis method, and computer-readable recording medium
CN109284681A (zh) * 2018-08-20 2019-01-29 北京市商汤科技开发有限公司 位姿检测方法及装置、电子设备和存储介质
CN109615655A (zh) * 2018-11-16 2019-04-12 深圳市商汤科技有限公司 一种确定物体姿态的方法及装置、电子设备及计算机介质
CN110119148A (zh) * 2019-05-14 2019-08-13 深圳大学 一种六自由度姿态估计方法、装置及计算机可读存储介质

Also Published As

Publication number Publication date
DE112019007677T5 (de) 2022-06-09
US11461925B2 (en) 2022-10-04
CN110503689A (zh) 2019-11-26
US20220180553A1 (en) 2022-06-09
CN110503689B (zh) 2022-04-26

Similar Documents

Publication Publication Date Title
WO2021035833A1 (zh) 姿态预测方法、模型训练方法及装置
TWI747325B (zh) 目標對象匹配方法及目標對象匹配裝置、電子設備和電腦可讀儲存媒介
CN109697734B (zh) 位姿估计方法及装置、电子设备和存储介质
CN108764069B (zh) 活体检测方法及装置
WO2021017358A1 (zh) 位姿确定方法及装置、电子设备和存储介质
WO2020134866A1 (zh) 关键点检测方法及装置、电子设备和存储介质
US11288531B2 (en) Image processing method and apparatus, electronic device, and storage medium
CN111783986A (zh) 网络训练方法及装置、姿态预测方法及装置
CN111462238B (zh) 姿态估计优化方法、装置及存储介质
CN110532956B (zh) 图像处理方法及装置、电子设备和存储介质
TW202141428A (zh) 場景深度和相機運動預測方法、電子設備和電腦可讀儲存介質
TWI778313B (zh) 圖像處理方法、電子設備和儲存介質
CN111523485A (zh) 位姿识别方法及装置、电子设备和存储介质
CN113065591A (zh) 目标检测方法及装置、电子设备和存储介质
CN114255221A (zh) 图像处理、缺陷检测方法及装置、电子设备和存储介质
CN111563138A (zh) 定位方法及装置、电子设备和存储介质
CN111311588B (zh) 重定位方法及装置、电子设备和存储介质
CN112991381A (zh) 图像处理方法及装置、电子设备和存储介质
WO2023155393A1 (zh) 特征点匹配方法、装置、电子设备、存储介质和计算机程序产品
CN112749709A (zh) 图像处理方法及装置、电子设备和存储介质
JP7261889B2 (ja) 共有地図に基づいた測位方法及び装置、電子機器並びに記憶媒体
CN109543544B (zh) 跨光谱图像匹配方法及装置、电子设备和存储介质
CN112837361B (zh) 一种深度估计方法及装置、电子设备和存储介质
CN112967311B (zh) 三维线图构建方法及装置、电子设备和存储介质
CN113297983A (zh) 人群定位方法及装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19943393

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19943393

Country of ref document: EP

Kind code of ref document: A1