CN115880776A

CN115880776A - Method for determining key point information and method and device for generating offline action library

Info

Publication number: CN115880776A
Application number: CN202211617345.2A
Authority: CN
Inventors: 李丰果; 刘豪杰; 陈睿智; 范斌; 赵晨; 孙昊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-12-13
Filing date: 2022-12-13
Publication date: 2023-03-31
Anticipated expiration: 2042-12-13
Also published as: CN115880776B

Abstract

The disclosure provides a method for determining key point information and a method and a device for generating an offline action library, relates to the field of artificial intelligence, particularly relates to the technical fields of augmented reality, virtual reality, computer vision, deep learning and the like, and can be applied to scenes such as a metaspace and a virtual digital person. The specific implementation scheme of the method for determining the key point information is as follows: predicting the action attitude of the target object in a subsequent image frame according to first key point information obtained by identifying the current action attitude of the target object in the current image frame to obtain key point information of the predicted action attitude; determining a first action characteristic corresponding to the current image frame according to the first key point information and the key point information of the predicted action posture; determining a target action characteristic of which the characteristic distance between the target action characteristic and the first action characteristic in the offline action library is smaller than a distance threshold; and determining the key point information of the current action posture according to the second key point information of the action posture of the target object in the image frame corresponding to the target action characteristic.

Description

Method for determining key point information and method and device for generating offline action library

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular to the technical fields of augmented reality, virtual reality, computer vision, deep learning, and the like, which can be applied to scenes such as the meta universe and virtual digital people.

Background

Research on understanding of the movement of a target object (e.g., a human body) is of great significance in a variety of fields including robotics, autopilots, human-computer interaction, and the meta universe. For example, the key point information of the motion pose of the target object may be obtained through a motion understanding technology, and the virtual scene may be constructed based on the obtained key point information. For example, a monocular camera is generally used to capture the motion of a target object, and the motion of a virtual object is generated in real time based on the key points of the captured motion.

Disclosure of Invention

The present disclosure is directed to a method for determining key point information and a method and apparatus for generating an offline action library, an electronic device, and a storage medium, which are used to improve the accuracy of captured key point information of an action gesture.

According to an aspect of the present disclosure, there is provided a method for determining keypoint information, including: predicting the action attitude of the target object in a subsequent image frame of the current image frame according to first key point information obtained by identifying the current action attitude of the target object in the current image frame to obtain key point information of the predicted action attitude; determining a first action characteristic corresponding to the current image frame according to the first key point information and the key point information of the predicted action posture; determining a target action characteristic of which the characteristic distance between the target action characteristic and the first action characteristic in the offline action library is smaller than a distance threshold; the offline action library comprises a plurality of action characteristics respectively corresponding to a plurality of image frames; each action characteristic in the off-line action library is determined according to the corresponding image frame and the subsequent image frame; and determining the key point information of the current action posture according to the second key point information of the action posture of the target object in the image frame corresponding to the target action characteristic.

According to another aspect of the present disclosure, there is provided a method for generating an offline action library, including: acquiring first key point information of the action posture of a target object in each image frame and second key point information of the action posture of the target object in a later image frame of each image frame aiming at each image frame in a preset number of image frames in an image sequence of the target action; determining action characteristics corresponding to each image frame according to the first key point information and the second key point information; and determining a preset number of action characteristics respectively corresponding to the preset number of image frames as action characteristics of the target action in the off-line action library.

According to another aspect of the present disclosure, there is provided an apparatus for determining keypoint information, comprising: the gesture prediction module is used for predicting the action gesture of the target object in a subsequent image frame of the current image frame according to first key point information obtained by recognizing the current action gesture of the target object in the current image frame to obtain key point information of the predicted action gesture; the characteristic determining module is used for determining a first action characteristic corresponding to the current image frame according to the first key point information and the key point information of the predicted action posture; the target action determining module is used for determining a target action characteristic of which the characteristic distance between the target action characteristic and the first action characteristic in the off-line action library is smaller than a distance threshold; the off-line action library comprises a plurality of action characteristics respectively corresponding to a plurality of image frames, and each action characteristic in the off-line action library is determined according to the corresponding image frame and the subsequent image frame; and the first information determining module is used for determining the key point information of the current action posture according to the second key point information of the action posture of the target object in the image frame corresponding to the target action characteristic.

According to another aspect of the present disclosure, there is provided an apparatus for generating an offline action library, including: the key point information acquisition module is used for acquiring first key point information of the action posture of the target object in each image frame and second key point information of the action posture of the target object in the subsequent image frame of each image frame aiming at each image frame of the front preset number of image frames in the image sequence of the target action; the first characteristic determining module is used for determining action characteristics corresponding to each image frame according to the first key point information and the second key point information; and the second characteristic determining module is used for determining that the action characteristics of the target action in the off-line action library are the action characteristics of the preset number of action characteristics respectively corresponding to the preset number of image frames.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method for determining keypoint information and/or the method for generating an offline action library provided by the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method for determining keypoint information and/or the method for generating an offline action library provided by the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising computer programs/instructions stored on at least one of a readable storage medium and an electronic device, which when executed by a processor, implement the method for determining keypoint information and/or the method for generating an offline action library provided by the present disclosure.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic application scenario diagram of a method for determining key point information and a method and an apparatus for generating an offline action library according to an embodiment of the present disclosure;

fig. 2 is a flowchart schematic of a method of determining keypoint information according to an embodiment of the disclosure;

FIG. 3 is a schematic diagram of the principle of determining feature distances between action features according to an embodiment of the present disclosure;

FIG. 4 is an overall flow diagram of a determination method to implement keypoint information according to an embodiment of the disclosure;

FIG. 5 is a flow chart diagram of a method of generating an offline action library according to an embodiment of the present disclosure;

fig. 6 is a block diagram of the structure of a determination apparatus of keypoint information according to an embodiment of the present disclosure;

fig. 7 is a block diagram of a structure of an offline action library generation apparatus according to an embodiment of the present disclosure; and

fig. 8 is a block diagram of an electronic device for implementing a method for determining keypoint information or a method for generating an offline action library according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Monocular motion capture refers to the steps of using a monocular camera to collect a single image, sensing a 3D key point of the motion posture of a target object in the single image, and then generating the motion of a virtual object in real time according to the 3D key point. However, the position accuracy of the 3D key points sensed in real time is not high, for example, the positions of the 3D key points are usually lost due to the existence of an occlusion condition, and when a complex motion (for example, a two-handed punch) is captured, there is a problem that a generated model of a virtual object is penetrated or some 3D key points cannot be touched during human-computer interaction.

In order to improve the accuracy of the determined key point information, the disclosure provides a method, a device, equipment and a medium for determining the key point information and generating an offline action library. An application scenario of the method and apparatus provided by the present disclosure is described below with reference to fig. 1.

Fig. 1 is a schematic application scenario diagram of a method for determining key point information and a method and an apparatus for generating an offline action library according to an embodiment of the present disclosure.

As shown in fig. 1, the application scenario 100 of this embodiment may include a terminal device 110, and the terminal device 110 may be various electronic devices with processing functions, including but not limited to a smart phone, a tablet computer, a laptop portable computer, a desktop computer, a smart camera, and so on.

The terminal device 110 may be integrated with a monocular motion capturing system, for example, for capturing images of a target object on line in real time and analyzing the images in real time for key point information 120 of motion poses of the target object. The terminal device 110 may for example generate a virtual object model with the resulting motion pose from keypoint information obtained by analyzing the image. The monocular motion capture system on the terminal device 110 may, for example, be communicatively coupled to a monocular camera to acquire images captured by the monocular camera in real time.

In an embodiment, after the key point information 120 is obtained through analysis, the terminal device 110 may further adjust the key point information 120 obtained through analysis according to the key point information of the standard motion generated in advance, for example, to make up for a defect that the position accuracy of the 3D key point obtained through sensing by the monocular motion capturing system is not high.

In one embodiment, the application scenario 100 may further include an electronic device 130, and the electronic device 130 may be, for example, a laptop portable computer, a desktop computer, a server, and the like. In one embodiment, the electronic device 130 may be a background management server supporting the monocular motion capture system in the terminal device 110. Terminal device 110 may be communicatively coupled to electronic device 130 via a network. The network may include wired or wireless communication links.

For example, the terminal device 110 may transmit the keypoint information 120 obtained by analyzing the image to the electronic device 130, and the electronic device 130 may adjust the keypoint information 120 according to the keypoint information of the standard operation generated in advance, to obtain the updated keypoint information 140. After obtaining the updated keypoint information 140, the electronic device 130 may send the updated keypoint information 140 to the terminal device 110, so that the terminal device 110 may generate the virtual object model.

In an embodiment, the electronic device 130 may generate an action feature according to the key point information 120, compare the action feature with an action feature in the offline action library 150, and use the key point information corresponding to the action feature matched with the action feature generated according to the key point information 120 in the offline action library 150 as the updated key point information 140.

The offline action library 150 may be generated in advance and stored in the database 160, for example. The key point information corresponding to the motion features in the offline motion library 150 may be captured by, for example, a multi-view camera motion capture system or an optical motion capture technology. The electronic device 130 may, for example, access the database 160 to obtain the action features in the offline action library 150, or may pre-cache the offline action library 150 in the database 160 into a cache space of the electronic device 130.

It should be noted that the method for determining the key point information provided by the present disclosure may be executed by the terminal device 110, or may be executed by the electronic device 130. Accordingly, the determination apparatus of the key point information provided by the present disclosure may be disposed in the terminal device 110, and may also be disposed in the electronic device 130. The method for generating the offline action library provided by the present disclosure may be executed by the electronic device 130, and may also be executed by any electronic device that can access the database 160. Accordingly, the offline action library generation apparatus provided by the present disclosure may be provided in the electronic device 130, and may also be provided in any electronic device that can access the database 160.

It should be understood that the number and types of terminal devices 110, electronic devices 130, and databases 160 in fig. 1 are merely illustrative. There may be any number and type of terminal devices 110, electronic devices 130, and databases 160, as desired for implementation.

The determination method of the key point information provided by the present disclosure will be described in detail below with reference to fig. 2 to 4.

Fig. 2 is a flowchart illustrating a method for determining keypoint information according to an embodiment of the disclosure.

As shown in fig. 2, the method 200 for determining keypoint information of this embodiment may include operations S210 to S240.

In operation S210, an action pose of the target object in a subsequent image frame of the current image frame is predicted according to first keypoint information obtained by recognizing a current action pose of the target object in the current image frame, so as to obtain keypoint information of the predicted action pose.

According to an embodiment of the present disclosure, the first keypoint information may be, for example, keypoint information of a current motion posture of the target object perceived by the monocular motion capture system after capturing the motion image online. The key point information may include, for example, position information of a plurality of key points of each part or joint of the target object. The location information may be, for example, location information of the 3D key point.

According to the embodiment of the present disclosure, for example, the motion postures in the next n image frames of the current image frame of the target object may be predicted, and the key point information of the n predicted motion postures is obtained. Wherein n is a natural number. The first keypoint information and the keypoint information of n predicted motion poses may constitute a sequence of keypoint information of length (n + 1).

According to the embodiment of the present disclosure, for example, a previous predictor may be adopted to perform a preliminary classification on the motion of the current motion pose according to the first key point information sensed by the monocular motion capture system, to obtain probabilities that the current motion pose belongs to various motions, and to select the top M motions with the highest probability from the probabilities as the guiding motions for motion generation. Subsequently, the guiding motion may be input into a motion generation network, and the keypoint information of the follow-up motion poses of the M motions may be generated by the motion generation network, respectively. And splicing the key point information of the subsequent action postures of the M actions with the first key point information to obtain key point information sequences representing M complete action posture sequences. Then, the complemented key point information sequences can be sent to a classification network to predict the probability of each key point information sequence corresponding to various actions, and the other key point information except the first key point information in the key point information sequence corresponding to the action with the highest output probability is used as the key point information for predicting the action posture. Here, M may be a natural number greater than 1, such as 2.

According to the embodiment of the present disclosure, for example, a Mixed Spatial-Temporal Encoder (MixSTE) may also be used to encode a keypoint information sequence formed by the first keypoint information and keypoint information of the motion pose in a previous image frame of the current image frame, and after processing features obtained by encoding through a regression network, keypoint information of the predicted motion pose is obtained. Wherein the MixSTE is composed of a spatial Transformer module (STB) and a temporal Transformer module (TTB), wherein the STB is used to calculate self-attention between the parts or joints represented by the key points and learn the relationship of the parts or joints of the object in each image frame. TTB is used to compute self-attention between image frames and focuses on learning the global temporal correlation of each part or joint.

According to the embodiment of the present disclosure, for example, a skeletal map scattering network (SGSN) based on an adaptive map scatterer may be further adopted to process a keypoint information sequence formed by the first keypoint information and keypoint information of the motion pose in a previous image frame of the current image frame, so as to obtain keypoint information of the predicted motion pose. Given the first key information, the SGSN first performs Discrete Cosine Transform (DCT) on a sequence of key points of each part or joint along a time axis in order to compress temporal variation to a more compact expression, and to retain smoothness in temporal order to facilitate stable feature learning, and to eliminate modeling of temporal order to reduce complexity and training difficulty of the model. The elements of the adjacency matrix of the target object structure diagram employed by each adaptive map scatterer are trainable. The training is stabilized by cross-over at the beginning and end of a series of adaptive map scatterers. And finally, recovering the key point information sequence through the inverse discrete cosine change to obtain the key point information of the predicted action attitude.

It is to be understood that the above method of obtaining the keypoint information of the predicted action gesture is only used as an example to facilitate understanding of the present disclosure, and the present disclosure is not limited thereto.

In operation S220, a first motion characteristic corresponding to the current image frame is determined according to the first key point information and the key point information of the predicted motion pose.

According to the embodiments of the present disclosure, for example, a Forward difference (Forward difference) algorithm, a Backward difference (Backward difference) algorithm, or the like may be used to calculate the moving speed and acceleration of a certain key point in the current image frame and the subsequent image frame according to the position of the certain key point in the first key point information and the position of the certain key point in the key point information of the predicted motion gesture.

In this way, the speed and acceleration of the key point of the motion attitude in each image frame in the current image frame and the subsequent image frame can be obtained. The position, velocity and acceleration of the keypoints of the motion pose in each image frame may constitute a feature vector. If the next image frame is the next n image frames, (n + 1) feature vectors can be obtained for each keypoint. This embodiment may constitute (n + 1) feature vectors into one feature vector sequence. The embodiment can splice the feature vector sequences of all the key points to obtain the first action feature corresponding to the current image frame.

In an embodiment, the action to which the action gesture belongs may also be predicted according to the first key point information, the action amplitude, the action completion progress and the like are determined according to the position relationship between the key points in the first key point information, and the belonging action, the action amplitude, the action completion progress and the like are taken as the semantic information of the current image frame. Similarly, semantic information of a subsequent image frame can be obtained according to the key point information of the predicted action posture. The embodiment may use information obtained by splicing the semantic information of the current image frame and the semantic information of the subsequent image frame as the first action feature corresponding to the current image frame.

In operation S230, a target motion feature in the offline motion library, which has a feature distance from the first motion feature smaller than a distance threshold, is determined.

The offline action library includes a plurality of action features respectively corresponding to the plurality of image frames. The action characteristics corresponding to each image frame in the off-line action library are determined according to each image frame and the subsequent image frame of each image frame. Each image frame and the following image frame are actually acquired image frames.

For example, the embodiment may compute the feature distance between each action feature in the offline action library and the first action feature, recursively. And then taking the action characteristics with the characteristic distance less than the preset distance from the first action characteristics in the offline action library as the target action characteristics.

For example, the embodiment may first determine an image frame corresponding to an action feature in an offline action library, and query, from the image frame, a target image frame whose similarity with the current image frame is greater than a similarity threshold. After the target image frame is obtained through query, the feature distance between the action feature corresponding to the target image frame and the first action feature can be calculated, and the feature with the minimum feature distance between the action feature corresponding to the target image frame and the first action feature is used as the target feature.

For example, the characteristic distance may be obtained by calculating a euclidean distance or a manhattan distance, for example, which is not limited by the present disclosure.

In operation S240, the key point information of the current motion pose is determined according to the second key point information of the motion pose of the target object in the image frame corresponding to the target motion feature.

In this embodiment, the second keypoint information of the motion posture in the image frame corresponding to the target motion feature may be used as the keypoint information of the current motion posture.

Alternatively, the embodiment may obtain the keypoint information of the current action gesture by comprehensively considering the first keypoint information and the second keypoint information. For example, the average of the second keypoint information and the first keypoint information may be used as the keypoint information of the current motion gesture.

In the technical scheme of the embodiment of the disclosure, since the action characteristics are determined by combining the current image frame and the predicted key point information of the action posture in the subsequent image frame, when the action characteristics are matched with the action characteristics in the offline action library, the accuracy of the target action characteristics obtained by matching can be improved. Therefore, when the key point information of the current action posture is determined according to the key point information of the action posture in the image frame corresponding to the matched target action characteristic, the effect of adjusting the key point information generated on line can be achieved, and the accuracy of the finally determined key point information of the current action posture is improved. By the method, the offline action characteristics obtained according to the plurality of image frames are matched, and the key points are determined according to the key point information corresponding to the action characteristics obtained through matching, so that the problem that the key point information is inaccurate and incomplete due to the fact that the key points are shielded can be solved, and the problem that a generated model of a virtual object is penetrated or some 3D key points cannot be touched during human-computer interaction can be solved. This is because the occlusion of a keypoint is usually different at different times.

In an embodiment, when the keypoint information of the current motion gesture is determined according to the second keypoint information, smoothing may be further performed on the first keypoint information according to the second keypoint information, and information obtained through smoothing is used as the keypoint information of the current motion gesture. The smoothing process may be implemented, for example, as follows: and performing interpolation operation on the first key point information and the second key point information. Alternatively, the smoothing process may be performed by using a Good-Turing (Good-turning) estimation method, which is not limited by the present disclosure. By means of smoothing, the problem that the initial state or the termination state of the action captured on the line is not matched with the action corresponding to the action characteristic in the off-line action library can be solved, and the accuracy of the obtained key point information of the current state is improved. The reason is that the first key point information obtained by identification is directly replaced by the second key point information, the second key point information is not matched with the current action posture due to mismatching of the starting state or the ending state, and the mismatching can be reduced through smoothing processing.

FIG. 3 is a schematic diagram of the principle of determining feature distances between action features according to an embodiment of the present disclosure.

In an embodiment, the action feature may indicate key point information of a plurality of key points of the target object, for example, the action feature may be formed by splicing key point information of a plurality of key points. The keypoint information may include the aforementioned position, velocity, acceleration, etc.

In this embodiment, the importance of a plurality of keypoints to the motion may also be considered in determining the feature distance between motion features. I.e. each keypoint is assigned a different weight for different actions. For example, for a squat motion, a keypoint at the knee may be assigned a greater weight, while for an open-arm motion, a keypoint at the knee may be assigned a lesser weight. Accordingly, weights assigned to the plurality of keypoints for the online captured action may be considered in determining the feature distance. Therefore, the determined target action characteristics can be matched with the characteristics of key points which are more important to the action captured on line, and the accuracy of the determined target action characteristics is improved.

In an embodiment, the motion features may each indicate keypoint information of multiple keypoints of the target object, and for different motion features, the multiple keypoints corresponding to the indicated keypoint information are the same. Then, after the first action characteristic is obtained, when the characteristic distance between the first action characteristic and the action characteristic in the offline action library is determined, the characteristic distance between the key point information is calculated for each key point in the plurality of key points. And finally, weighting the characteristic distance between the calculated key point information according to the weight distributed to the key points, so as to obtain the characteristic distance between the action characteristics.

As shown in fig. 3, in the embodiment 300, if the plurality of key points of the target object include first to mth key points, the key point information indicated by the determined first motion profile 310 includes key point information a 1311 to key point information a _ m312, and the key point information indicated by each offline motion profile 320 in the offline motion library includes key point information b 1321 to key point information b m 322. In this embodiment, the feature distance between the key point information a 1311 and the key point information b 1321 may be calculated first to obtain the first sub-feature distance 331, the feature distance between the key point information a 2 and the key point information b 2 may be calculated to obtain the second sub-feature distance, and so on, the feature distance between the key point information a m312 and the key point information b m 322 may be calculated to obtain the mth sub-feature distance 332, and m sub-feature distances may be obtained in total. Subsequently, the embodiment 300 may weight the m sub-feature matrices according to the weights (e.g., the first weight 341 to the mth weight 342) of the m key points corresponding to the current motion, so as to obtain the feature distance 350 between the offline motion feature 320 and the first motion feature 310.

Fig. 4 is an overall flowchart of a determination method of implementing the keypoint information according to an embodiment of the disclosure.

In one embodiment, the motion features in the offline motion library may be determined in real-time from keypoint information perceived by the monocular motion capture system. If the target motion characteristics are matched for the first time, it can be determined that the monocular motion capture system enters a matching state. And then, continuously determining the target motion characteristics in real time, and if the target motion characteristics are not matched for the first time in a time period after the monocular motion capture system is in the matching state, determining that the monocular motion capture system exits the matching state. After exiting the matching state, matching with the action features in the offline action library is continuously performed. That is, the operation of matching with the action feature in the offline action library is an operation executed in a loop.

In an embodiment, in a cycle in which the monocular motion capturing system exits the matching state, that is, after the monocular motion capturing system is in the matching state, when the target motion feature is not matched for the first time, for example, the first keypoint information of the motion pose in the previous image frame of the current image frame may be further adjusted by combining the keypoint information of the motion pose in the current image frame. Therefore, the determined key point information can be smoothly transited to the perceived key point information, the condition that the difference between the key point information of the action postures in two continuous image frames is too large is avoided, the determined key point information is more fit with the actual condition, and the precision of the key point information is improved.

Based on this, as shown in fig. 4, in this embodiment 400, the method for determining the key point information may include operations S410 to S460, for example.

In operation S410, a motion pose of the target object in a subsequent image frame of the current image frame is predicted according to first key point information sensed by the monocular motion capture system in real time, and key point information of the predicted motion pose is obtained. The operation S410 is similar to the operation S210 described above, and is not described herein again.

In operation S420, a first motion characteristic is determined according to the first keypoint information and the keypoint information of the predicted motion pose. The operation S420 is similar to the operation S220 described above, and is not described herein again.

In operation S430, it is determined whether a target action feature exists in the offline action library. If so, operation S440 is performed, otherwise, operation S450 is performed. It is to be understood that the implementation principle of operation S430 is similar to the implementation principle of operation S230 described above, and if it is determined that there is an action feature in the offline action library whose feature distance from the first action feature is smaller than the predetermined distance, it is determined that the target action feature exists.

In operation S440, it is determined that the monocular motion capturing system is in the matching state, and the keypoint information of the current motion pose is determined according to the second keypoint information of the motion pose of the target object in the image frame corresponding to the target motion feature. The principle of determining the key point information of the current motion gesture in operation S440 is similar to the implementation principle of operation S240 described above, and is not described herein again. It will be appreciated that the monocular motion capture system may be determined to be in a matching state as long as the target motion characteristic is matched.

In operation S450, it is determined whether the monocular motion capture system is in a matching state. If yes, operation S460 is performed, otherwise, operation S410 is returned to. This operation S450 may be understood as determining whether there is an action feature in the offline action library whose feature distance from the second action feature corresponding to the previous image frame is less than a predetermined distance, i.e., whether there is an action feature in the offline action library that matches the second action feature.

In operation S460, it is determined that the monocular motion capturing system exits the matching state, and the keypoint information of the current motion pose is determined based on the first keypoint information and the third keypoint information of the motion pose in the previous image frame of the current image frame. It is understood that the third key point information is determined according to the key point information of the motion pose of the image frame corresponding to the target motion feature in the offline motion library in the previous cycle.

For example, the embodiment may use an average value of the first keypoint information and the third keypoint information as the keypoint information of the current motion pose. Or, the first key point information may be smoothed according to the third key point information, so as to obtain the key point information of the current motion posture.

It is understood that after the operation S440 or the operation S460 is performed, the operation S410 may be returned to be performed continuously to enter the next loop. The embodiment may determine that the first keypoint information is the keypoint information of the current motion pose in the current image frame, if it is determined that the target motion feature is not matched and the monocular motion capture system is not in the matching state.

In an embodiment, in the case that the monocular motion capturing system is in the matching state, when returning to perform operation S410, the motion pose in the subsequent image frame may be predicted, for example, according to the first keypoint information and the fourth keypoint information of the motion pose in the previous image frame. That is, in the case where there is an action feature having a feature distance from the second action feature smaller than a predetermined distance in the offline action library, the action pose in the subsequent image frame may be predicted from the first keypoint information and the fourth keypoint information of the action pose in the previous image frame. That is, when there is an action feature matching the second action feature in the offline action library, the action pose in the subsequent image frame is predicted by combining the fourth key point information of the action pose in the previous image frame. The second motion characteristic is a motion characteristic corresponding to a previous image frame of the current video frame. In this way, the accuracy of the predicted motion attitude and the key point information of the predicted motion attitude can be improved.

In order to facilitate implementation of the method for determining the key point information in the embodiment of the present disclosure, the present disclosure further provides a method for generating an offline action library, which will be described in detail below with reference to fig. 5.

Fig. 5 is a flowchart illustrating a method for generating an offline action library according to an embodiment of the disclosure.

As shown in fig. 5, the method 500 for generating an offline action library according to this embodiment may include operations S510 to S530, for example.

In operation S510, for each image frame of a predetermined number of previous image frames in an image sequence of a target motion, first keypoint information of a motion pose of a target object in each image frame and second keypoint information of a motion pose of the target object in a subsequent image frame of each image frame are acquired.

For example, the embodiment may extract consecutive (n + 1) image frames from an image sequence in which images are arranged in the order of time of acquisition, take an image frame arranged at the forefront position among the (n + 1) image frames as each image frame, and take the remaining image frames as subsequent image frames. This embodiment may for example be decimated to a predetermined number of groups of image frames, each group comprising a succession of (n + 1) image frames. The first predetermined number may be determined according to the value of n. For example, if the image sequence includes Q image frames, the first predetermined number refers to the first (Q-n) number, and the number of the obtained image frame groups is also (Q-n).

For example, the embodiment may use a key point detection algorithm to detect and obtain the first key point information and the second key point information. Alternatively, the key point information of the motion pose in the image frame may be obtained by sensing the acquired image frame by the motion capture system when the image frame is acquired.

It will be appreciated that the target motion may comprise, for example, a plurality of motion poses in succession, and the motion pose in the image frame may be one of the plurality of motion poses. The target movement may include any movement such as a squatting movement, a movement of opening both arms, and a movement of jumping, which is not limited in the present disclosure.

In operation S520, an action feature corresponding to each image frame is determined according to the first and second key point information. It is understood that in this embodiment, the principle of determining the motion feature corresponding to each image frame is similar to the principle of determining the first motion gesture in operation S220 described above, and is not described herein again.

In operation S530, a predetermined number of motion features respectively corresponding to the predetermined number of image frames are determined as motion features of a target motion in an offline motion library.

This embodiment may determine that one motion feature is obtained for each extracted set of (n + 1) image frames, and (Q-n) motion features may be obtained in total. The embodiment may then store the (Q-n) action signatures in association with the name of the target action, resulting in an action signature for the target action in the offline action library.

In this embodiment, the action characteristics can be obtained for all of the target actions, and the action characteristics of the target actions can constitute an offline action library.

In one embodiment, for example, a multi-view motion capture technique may be employed to capture an image sequence of the target motion and perceive keypoint information that yields the motion pose of the target object in each image frame in the image sequence. Therefore, all key points of the target object can be captured in an all-around manner, and the generated offline action library comprises complete key point information of the target action. When the above-described method for determining the key point information is executed based on the offline action library, the determined key point information can be more complete and cannot be influenced by the fact that the key points are shielded in online capturing, and therefore the problem that a generated model of a virtual object is penetrated or some 3D key points cannot be touched in human-computer interaction is better solved.

For example, the multi-view motion capture technique may be an optical motion capture technique to further improve the accuracy of the captured keypoint information.

Based on the determination method of the key point information provided by the present disclosure, the present disclosure also provides a determination apparatus of key point information, which will be described in detail below with reference to fig. 6.

Fig. 6 is a block diagram of a structure of a determination apparatus of keypoint information according to an embodiment of the present disclosure.

As shown in fig. 6, the apparatus 600 for determining keypoint information of this embodiment may include a posture prediction module 610, a feature determination module 620, a target action determination module 630, and a first information determination module 640.

The pose prediction module 610 is configured to predict an action pose of the target object in a subsequent image frame of the current image frame according to first keypoint information obtained by identifying a current action pose of the target object in the current image frame, so as to obtain keypoint information of the predicted action pose. In an embodiment, the gesture prediction module 610 may be configured to perform the operation S210 described above, which is not described herein again.

The feature determining module 620 is configured to determine a first motion feature corresponding to the current image frame according to the first keypoint information and the keypoint information of the predicted motion pose. In an embodiment, the characteristic determining module 620 may be configured to perform the operation S220 described above, which is not described herein again.

Target action determination module 630 is used to determine a target action feature in the offline action library whose feature distance from the first action feature is less than a distance threshold. The offline action library comprises a plurality of action characteristics respectively corresponding to a plurality of image frames, and each action characteristic in the offline action library is determined according to the corresponding image frame and the subsequent image frame. In an embodiment, the target action determining module 630 may be configured to perform the operation S230 described above, which is not described herein again.

The first information determining module 640 is configured to determine, according to second keypoint information of an action pose of the target object in the image frame corresponding to the target action feature, keypoint information of a current action pose. In an embodiment, the first information determining module 640 may be configured to perform the operation S240 described above, which is not described herein again.

According to an embodiment of the present disclosure, the action feature indicates keypoint information of a plurality of keypoints of the target object. The apparatus 600 for determining keypoint information may further include a first distance determining module and a second distance determining module. The first distance determining module is used for determining sub-feature distances between the key point information of each key point indicated by each action feature in the off-line action library and the key point information of each key point indicated by the first action feature to obtain a plurality of sub-feature distances. The second distance determining module is used for weighting the plurality of sub-feature distances according to the weights of the plurality of key points corresponding to the current action to obtain the feature distance between each action feature and the first action feature.

According to an embodiment of the present disclosure, the above-mentioned determining apparatus 600 of the keypoint information may further include a second information determining module, configured to, in response to that the target motion does not exist in the offline motion library, but a motion feature exists whose feature distance from a second motion feature is smaller than a predetermined distance, determine the keypoint information of the current motion pose according to the first keypoint information and third keypoint information of the motion pose in a previous image frame of the current image frame, where the second motion feature is a motion feature corresponding to the previous image frame.

According to an embodiment of the present disclosure, the gesture prediction module 610 is further specifically configured to predict the motion gesture of the target object in a subsequent image frame according to the first key point information and fourth key point information of the motion gesture in a previous image frame of the current image frame, in response to the presence of a motion feature having a feature distance from a second motion feature being smaller than a predetermined distance in the offline motion library, where the second motion feature is a motion feature corresponding to the previous image frame.

According to an embodiment of the present disclosure, the first information determining module 640 is further specifically configured to perform smoothing processing on the first key point information according to the second key point information to obtain the key point information of the current action posture.

Based on the method for generating the offline action library provided by the present disclosure, the present disclosure also provides a device for generating the offline action library, which will be described in detail below with reference to fig. 7.

Fig. 7 is a block diagram of a structure of an offline action library generation device according to an embodiment of the present disclosure.

As shown in fig. 7, the apparatus 700 for generating an offline action library according to this embodiment may include a key point information obtaining module 710, a first characteristic determining module 720, and a second characteristic determining module 730.

The keypoint information acquisition module 710 is configured to acquire, for each image frame of a predetermined number of previous image frames in the image sequence of the target motion, first keypoint information of a motion pose of the target object in each image frame, and second keypoint information of a motion pose of the target object in a subsequent image frame of each image frame. In an embodiment, the keypoint information obtaining module 710 may be configured to perform the operation S510 described above, which is not described herein again.

The first feature determining module 720 is configured to determine an action feature corresponding to each image frame according to the first key point information and the second key point information. In an embodiment, the first characteristic determining module 720 may be configured to perform the operation S520 described above, which is not described herein again.

The second feature determining module 730 is configured to determine a predetermined number of motion features respectively corresponding to the predetermined number of image frames as motion features of a target motion in the offline motion library. In an embodiment, the second characteristic determining module 730 may be configured to perform the operation S530 described above, which is not described herein again.

According to an embodiment of the present disclosure, the key point information obtaining module 710 may be specifically configured to: and capturing the image sequence by adopting a multi-view motion capture technology to obtain the key point information of the motion posture of the target object in each image frame in the image sequence.

In the technical scheme of the present disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and applying the personal information of the related users all conform to the regulations of related laws and regulations, and necessary security measures are taken without violating the good customs of the public order. In the technical scheme of the disclosure, before the personal information of the user is acquired or collected, the authorization or the consent of the user is acquired.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement the method of determining keypoint information or the method of generating an offline action library of embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as the determination method of the key point information or the generation method of the offline action library. For example, in some embodiments, the method of determining the keypoint information or the method of generating the offline action library may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 800 via ROM 802 and/or communications unit 809. When the computer program is loaded into the RAM803 and executed by the computing unit 801, one or more steps of the above-described determination method of the keypoint information or generation method of the offline action library may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured in any other suitable manner (e.g., by means of firmware) to perform the determination method of the keypoint information or the generation method of the offline action library.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a conventional physical host and VPS service ("Virtual Private Server", or "VPS" for short). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method for determining key point information comprises the following steps:

predicting the action attitude of a target object in a subsequent image frame of a current image frame according to first key point information obtained by identifying the current action attitude of the target object in the current image frame to obtain key point information of the predicted action attitude;

determining a first action characteristic corresponding to the current image frame according to the first key point information and the key point information of the predicted action posture;

determining a target action characteristic in an offline action library, wherein the characteristic distance between the target action characteristic and the first action characteristic is smaller than a distance threshold; the off-line action library comprises a plurality of action characteristics respectively corresponding to a plurality of image frames; and

determining the key point information of the current action gesture according to the second key point information of the action gesture of the target object in the image frame corresponding to the target action feature,

wherein each action feature in the offline action library is determined from the corresponding image frame and its subsequent image frames.

2. The method of claim 1, wherein the action features indicate keypoint information of a plurality of keypoints of the target object; the method further comprises the following steps:

determining sub-feature distances between the key point information of each key point indicated by each action feature in the off-line action library and the key point information of each key point indicated by the first action feature to obtain a plurality of sub-feature distances; and

and weighting the sub-feature distances according to the weights of the key points corresponding to the current action to obtain the feature distance between each action feature and the first action feature.

3. The method of claim 1, wherein the method further comprises:

in response to the target action not existing in the offline action library but an action feature having a feature distance from a second action feature smaller than the predetermined distance, determining keypoint information of the current action pose from the first keypoint information and third keypoint information of an action pose in a previous image frame of the current image frame,

wherein the second motion feature is a motion feature corresponding to the previous image frame.

4. The method of claim 1, wherein predicting the motion pose of the target object in a later image frame of a current image frame according to first keypoint information obtained by identifying the current motion pose of the target object in the current image frame, and obtaining the keypoint information of the predicted motion pose comprises:

predicting the motion pose of the target object in a subsequent image frame according to the first key point information and fourth key point information of the motion pose in a previous image frame of the current image frame in response to the existence of a motion feature of which the feature distance from the second motion feature is smaller than the predetermined distance in the offline motion library,

5. The method according to claim 1, wherein the determining, according to second keypoint information of a motion pose of the target object in an image frame corresponding to the target motion feature, the keypoint information of the current motion pose comprises:

and smoothing the first key point information according to the second key point information to obtain the key point information of the current action attitude.

6. A method for generating an offline action library comprises the following steps:

acquiring first key point information of a motion posture of a target object in each image frame and second key point information of the motion posture of the target object in a later image frame of each image frame aiming at each image frame of a preset number of image frames in an image sequence of target motion;

determining an action characteristic corresponding to each image frame according to the first key point information and the second key point information; and

determining a predetermined number of motion characteristics respectively corresponding to the predetermined number of image frames as motion characteristics of the target motion in the offline motion library.

7. The method of claim 6, wherein obtaining first keypoint information of a motion pose of a target object in the each image frame and second keypoint information of a motion pose of the target object in a subsequent image frame of the each image frame comprises:

capturing the image sequence by adopting a multi-purpose motion capture technology to obtain key point information of the motion posture of the target object in each image frame in the image sequence.

8. A determination apparatus of keypoint information, comprising:

the gesture prediction module is used for predicting the action gesture of the target object in a subsequent image frame of the current image frame according to first key point information obtained by recognizing the current action gesture of the target object in the current image frame to obtain key point information of the predicted action gesture;

the characteristic determining module is used for determining a first action characteristic corresponding to the current image frame according to the first key point information and the key point information of the predicted action posture;

the target action determining module is used for determining a target action characteristic of which the characteristic distance between the target action characteristic and the first action characteristic in the offline action library is smaller than a distance threshold; the off-line action library comprises a plurality of action characteristics respectively corresponding to a plurality of image frames; and

a first information determining module, configured to determine, according to second keypoint information of an action pose of the target object in the image frame corresponding to the target action feature, keypoint information of the current action pose,

9. The apparatus of claim 8, wherein the action feature indicates keypoint information for a plurality of keypoints of the target object; the device further comprises:

a first distance determining module, configured to determine a sub-feature distance between the key point information of each key point indicated by each action feature in the offline action library and the key point information of each key point indicated by the first action feature, to obtain a plurality of sub-feature distances; and

and the second distance determining module is used for weighting the plurality of sub-feature distances according to the weights of the plurality of key points corresponding to the current action to obtain the feature distance between each action feature and the first action feature.

10. The apparatus of claim 8, further comprising:

a second information determining module, configured to determine, in response to that the target motion does not exist in the offline motion library but a motion feature exists whose feature distance from a second motion feature is smaller than the predetermined distance, the keypoint information of the current motion pose according to the first keypoint information and third keypoint information of a motion pose in a previous image frame of the current image frame,

wherein the second motion characteristic is a motion characteristic corresponding to the previous image frame.

11. The apparatus of claim 8, wherein the pose prediction module is to:

in response to the existence of an action feature with a feature distance smaller than the predetermined distance from a second action feature in the offline action library, predicting an action pose of the target object in a subsequent image frame according to the first keypoint information and fourth keypoint information of an action pose in a previous image frame of the current image frame,

12. The apparatus of claim 8, wherein the first information determination module is to:

and smoothing the first key point information according to the second key point information to obtain the key point information of the current action posture.

13. An apparatus for generating an offline action library, comprising:

the system comprises a key point information acquisition module, a motion gesture recognition module and a motion gesture recognition module, wherein the key point information acquisition module is used for acquiring first key point information of a motion gesture of a target object in each image frame and second key point information of the motion gesture of the target object in a later image frame of each image frame aiming at each image frame of a preset number of image frames in an image sequence of the target motion;

a first feature determination module, configured to determine, according to the first key point information and the second key point information, an action feature corresponding to each image frame; and

a second feature determination module, configured to determine a predetermined number of motion features respectively corresponding to the predetermined number of image frames as motion features of the target motion in the offline motion library.

14. The apparatus of claim 13, wherein the keypoint information acquisition module is configured to:

and capturing the image sequence by adopting a multi-view motion capture technology to obtain the key point information of the motion posture of the target object in each image frame in the image sequence.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to any one of claims 1 to 7.

17. A computer program product comprising computer program/instructions stored on at least one of a readable storage medium and an electronic device, which when executed by a processor implement the steps of the method according to any one of claims 1 to 7.