CN113673436A

CN113673436A - Behavior recognition and model training method and device

Info

Publication number: CN113673436A
Application number: CN202110969952.4A
Authority: CN
Inventors: 史磊
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2021-08-23
Filing date: 2021-08-23
Publication date: 2021-11-19

Abstract

The specification discloses a behavior recognition and model training method and a device, which relate to the field of unmanned driving and can acquire a video to be recognized, wherein the video to be recognized comprises a motion video of a target object, then the video to be recognized is input into a recognition model trained in advance to obtain a video feature corresponding to the video to be recognized, a joint feature corresponding to each joint part of the target object is determined according to the video feature, a space-time diagram is determined according to the joint feature corresponding to each joint part, a behavior feature of the target object is determined according to the space-time diagram, behavior recognition is carried out on the target object according to the behavior feature, wherein the space-time diagram is used for representing the connection relation between each joint part of the target object in space and the motion situation of each joint part of the target object in time, each joint part is not independently viewed when the behavior feature is determined, the relation among all joint parts is referred to, so that the accuracy of behavior recognition is improved.

Description

Behavior recognition and model training method and device

Technical Field

The specification relates to the field of unmanned driving, in particular to a method and a device for behavior recognition and model training.

Background

Currently, a behavior recognition technology for a human body may be applied in various fields, for example, in the field of unmanned driving, an unmanned device may determine a driving strategy for a next step by recognizing a behavior of a surrounding pedestrian (e.g., whether the pedestrian is walking, etc.). Of course, the behavior recognition technology can be applied in other fields.

In the prior art, the behavior of a person in a video can be recognized through a captured video, wherein the characteristics of each part of a human body can be determined through the video, and the recognized behavior of the person can be obtained by inputting the characteristics of each part into a deep learning model, but in this way, the relationship between each part of the human body is not considered (for example, a certain driving relationship exists between a wrist and an elbow), so that the accuracy of behavior recognition in this way is low.

Therefore, how to improve the accuracy of behavior recognition is an urgent problem to be solved.

Disclosure of Invention

The present specification provides a method and an apparatus for behavior recognition and model training, which partially solve the above problems in the prior art.

The technical scheme adopted by the specification is as follows:

the present specification provides a method of behavior recognition, comprising:

acquiring a video to be identified, wherein the video to be identified comprises a motion video of a target object;

inputting the video to be recognized into a recognition model trained in advance to obtain video characteristics corresponding to the video to be recognized;

determining a space-time diagram according to the joint features corresponding to each joint part, wherein the space-time diagram is used for representing the connection relation between each joint part of the target object in space and the motion situation of each joint part of the target object in time;

and determining the behavior characteristics of the target object according to the space-time diagram, and performing behavior recognition on the target object according to the behavior characteristics.

Optionally, the identification model includes a video feature extraction layer and a prediction layer;

inputting the video to be recognized into a recognition model trained in advance to obtain video features corresponding to the video to be recognized, and the method specifically comprises the following steps:

inputting the video to be identified into the video feature extraction layer to obtain a feature matrix corresponding to the video to be identified, and using the feature matrix as the video feature corresponding to the video to be identified;

determining the behavior of the target object according to the behavior characteristics, specifically comprising:

and inputting the behavior characteristics into the prediction layer, and determining the behavior of the target object.

Optionally, each video frame in the video to be recognized corresponds to a limb structure diagram, the space-time diagram includes a limb structure diagram corresponding to each video frame in the video to be recognized, each node included in one limb structure diagram is used for representing a joint part of a target object, an edge between nodes in one limb structure diagram is used for representing a relationship between joint parts, and in the space-time diagram, nodes representing the same joint part in different limb structure diagrams are connected according to a time sequence of each video frame.

Optionally, determining, according to the video feature, a joint feature corresponding to each joint part of the target object, specifically including:

for each video frame, determining joint features of each joint part of the video frame from the video features according to the position of each joint part of the target object in the video frame;

determining a space-time diagram according to the joint features corresponding to each joint part, specifically comprising:

aiming at each video frame, determining the node initial characteristics of each node in the limb structure chart corresponding to the video frame according to the joint characteristics of each joint part under the video frame;

and determining the space-time diagram according to the node initial characteristics of each node in the limb structure diagram corresponding to each video frame.

Optionally, determining the behavior feature of the target object according to the space-time diagram specifically includes:

for any limb structure diagram contained in the space-time diagram, aiming at each node in the limb structure diagram, adjusting the node initial characteristics of the node according to the node initial characteristics of the node which has a connection relation with the node in the limb structure diagram to obtain the node characteristics of the node;

and determining the behavior characteristics of the target object according to the node characteristics of each node in the space-time diagram.

aggregating the node characteristics of the nodes representing the same joint parts in each limb structure diagram in the space-time diagram to obtain the aggregated characteristics corresponding to each joint part;

and determining the behavior characteristics of the target object according to the polymerized characteristics corresponding to each joint part.

Optionally, the video to be identified is collected by the unmanned device, and the target object is a target object around the unmanned device;

the method further comprises the following steps:

and determining a control strategy aiming at the unmanned equipment according to the recognized behavior of the target object, and controlling the unmanned equipment.

The present specification provides a method of model training, comprising:

acquiring a first training sample, wherein the first training sample comprises a first sample video and annotation data corresponding to the first sample video;

inputting the first sample video into a preset identification model to obtain video characteristics corresponding to the first sample video;

determining joint features of each joint part of a target object in the first sample video according to the video features, determining a space-time diagram according to the joint features corresponding to each joint part, and determining behavior features of the target object in the first sample video according to the space-time diagram, wherein the space-time diagram is used for representing the connection relation between each joint part of the target object in space and the motion situation of each joint part of the target object in time;

and determining a recognition result representing the behavior of the target object according to the behavior characteristics, and training the recognition model by taking the minimized deviation between the recognition result and the labeled data corresponding to the first sample video as an optimization target.

Optionally, the recognition model includes a video feature extraction layer to be trained and a prediction layer to be trained, the video feature extraction layer is used for extracting video features, and the prediction layer is used for determining behaviors of the target object;

training the recognition model by taking the minimized deviation between the recognition result and the labeled data corresponding to the first sample video as an optimization target, specifically comprising:

and performing joint training on the video feature extraction layer and the prediction layer by taking the minimization of the deviation between the recognition result and the labeled data corresponding to the first sample video as an optimization target.

Optionally, the recognition model includes a pre-trained video feature extraction layer and a prediction layer to be trained;

pre-training the video feature extraction layer specifically comprises:

acquiring a second training sample, wherein the second training sample comprises a second sample video and marking data corresponding to the second sample video;

inputting the second sample video into a feature extraction layer to be trained to obtain a feature matrix corresponding to the second sample video, wherein each feature value in the feature matrix corresponds to one image area in the second sample video, and each video frame in the second sample video is composed of a plurality of image areas;

inputting the characteristic value into a full-link layer corresponding to the video region according to each characteristic value in the characteristic matrix to obtain an identification result corresponding to the characteristic value, wherein each characteristic value in one characteristic matrix corresponds to one full-link layer;

and training the video feature extraction layer by taking the deviation between the recognition result corresponding to each feature value and the labeling data corresponding to the second sample video as an optimization target.

The present specification provides an apparatus for behavior recognition, comprising:

the system comprises an acquisition module, a recognition module and a display module, wherein the acquisition module is used for acquiring a video to be recognized, and the video to be recognized comprises a motion video of a target object acquired by unmanned equipment;

the input module is used for inputting the video to be recognized into a pre-trained recognition model to obtain the video characteristics corresponding to the video to be recognized;

the characteristic module is used for determining joint characteristics corresponding to each joint part of the target object according to the video characteristics;

the determining module is used for determining a space-time diagram according to the joint features corresponding to each joint part, wherein the space-time diagram is used for representing the connection relation between each joint part of the target object in space and the motion situation of each joint part of the target object in time;

and the identification module is used for determining the behavior characteristics of the target object according to the space-time diagram and carrying out behavior identification on the target object according to the behavior characteristics.

The present specification provides an apparatus for model training, comprising:

the acquisition module acquires a first training sample, wherein the first training sample comprises a first sample video and marking data corresponding to the first sample video;

the input module is used for inputting the first sample video into a preset identification model to obtain the video characteristics corresponding to the first sample video;

the determining module is used for determining joint features of each joint part of a target object in the first sample video according to the video features, determining a space-time diagram according to the joint features corresponding to each joint part, and determining behavior features of the target object in the first sample video according to the space-time diagram, wherein the space-time diagram is used for representing the connection relation between each joint part of the target object in space and the motion situation of each joint part of the target object in time;

and the training module is used for determining a recognition result representing the behavior of the target object according to the behavior characteristics, and training the recognition model by taking the minimized deviation between the recognition result and the labeled data corresponding to the first sample video as an optimization target.

The present specification provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above-described method of behavior recognition or method of model training.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the above behavior recognition method or model training method when executing the program.

The technical scheme adopted by the specification can achieve the following beneficial effects:

the method comprises the steps of obtaining a video to be recognized, wherein the video to be recognized comprises a motion video of a target object acquired by unmanned equipment, inputting the video to be recognized into a pre-trained recognition model to obtain video features corresponding to the video to be recognized, determining joint features corresponding to each joint part of the target object according to the video features, determining a space-time diagram according to the joint features corresponding to each joint part, determining behavior features of the target object according to the space-time diagram, wherein the space-time diagram is used for representing the connection relation between each joint part of the target object in space and the motion situation of each joint part of the target object in time, and performing behavior recognition on the target object.

It can be seen from the above method that, when determining the behavior feature of the target object, the method combines the space-time diagram capable of representing the connection relationship between the joint parts of the target object and the motion situation corresponding to each joint part, so that when determining the behavior feature, each joint part is not independently viewed, and the relationship between the joint parts is referred to, therefore, compared with the prior art, the method for determining the behavior of the target object by the behavior feature improves the accuracy of behavior recognition.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification and are incorporated in and constitute a part of this specification, illustrate embodiments of the specification and together with the description serve to explain the specification and not to limit the specification in a non-limiting sense. In the drawings:

FIG. 1 is a schematic flow chart of a method for behavior recognition in the present specification;

FIG. 2A is a schematic representation of a limb structure provided herein;

FIG. 2B is a schematic diagram of the connection relationship between various limb structure diagrams provided in this specification;

FIG. 3 is a schematic flow chart of a method of model training in the present specification;

FIG. 4 is a schematic diagram of an apparatus for behavior recognition provided herein;

FIG. 5 is a schematic diagram of an apparatus for model training provided herein;

fig. 6 is a schematic diagram of an electronic device corresponding to fig. 1 or fig. 3 provided in the present specification.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more clear, the technical solutions of the present disclosure will be clearly and completely described below with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without any creative effort belong to the protection scope of the present specification.

The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a method for behavior recognition in this specification, which specifically includes the following steps:

s101: and acquiring a video to be identified, wherein the video to be identified comprises a motion video of the target object.

S102: and inputting the video to be recognized into a pre-trained recognition model to obtain the video characteristics corresponding to the video to be recognized.

In practical applications, behavior recognition can be applied to various fields, such as unmanned driving, human-computer interaction, intelligent monitoring and the like. For example, in the field of unmanned driving, an unmanned device may recognize behaviors of a target object such as a pedestrian and a passenger through a behavior recognition technology, and determine its own control strategy according to the recognized behaviors, so that it is critical to implement an accurate control strategy for the unmanned device to accurately recognize the behaviors.

Based on this, the unmanned device (or other devices that need behavior recognition, where the execution subject is not limited) may obtain a video to be recognized, where the video to be recognized may include a motion video of a target object, and then the unmanned device may input the video to be recognized into a recognition model trained in advance to obtain a video feature corresponding to the video to be recognized. The video feature corresponding to the video to be recognized may be obtained through a video feature extraction layer included in the recognition model, where the target object may be a living body having a joint, such as a person, an animal, and the like, the video feature extraction layer may be a 3D Convolutional Neural network (3D CNN), and of course, the video feature extraction layer may also be another Neural network used for extracting video features, which is not described herein again, or the video feature may also be a set of image features corresponding to each video frame included in the video to be recognized.

S103: and determining a space-time diagram according to the joint features corresponding to each joint part, wherein the space-time diagram is used for representing the connection relation between each joint part of the target object in space and the motion situation of each joint part of the target object in time.

S104: and determining the behavior characteristics of the target object according to the space-time diagram, and performing behavior recognition on the target object according to the behavior characteristics.

After the video characteristics of the video to be recognized are determined, the unmanned device can determine the joint characteristics corresponding to each joint part of the target object according to the video characteristics, determine a space-time diagram according to the joint characteristics corresponding to each joint part, and determine the behavior characteristics of the target object according to the space-time diagram.

The space-time diagram mentioned here can be used to represent the connection relationship between each joint part in the target object in space and the motion situation of each joint part of the target object in time, that is, the space-time diagram can be used to represent the relationship between joints and the motion of the joint of the target object, so that when determining the behavior feature of the target object, not only the feature of the target object, which is determined from the video to be recognized, is independent of each joint part, but also the relationship between each joint part is combined, so that when performing behavior recognition through the behavior feature of the target object, the relationship between each joint part can be considered, and thus behavior recognition can be performed more accurately.

The concept of a limb structure diagram (the limb structure diagram is a structure of a topological graph shown in fig. 2A) is introduced in the present specification, each node included in one limb structure diagram is used for representing one joint part of an object, and edges connecting different nodes in the limb structure diagram can represent the connection relationship between the joint parts. The specific representation form of the space-time diagram can be various.

For example, the space-time diagram may include a limb structure diagram, that is, one video to be recognized corresponds to one limb structure diagram, and each joint part may also correspond to only one joint feature in one video to be recognized, where the joint feature may represent a motion feature of the joint part in the video to be recognized. Each joint part corresponds to one node in the space-time diagram, so the node initial characteristic of each node in the space-time diagram can be represented by the joint characteristic corresponding to the node, therefore, the space-time diagram not only comprises the node representing each joint part, but also comprises the node initial characteristic of each node, and the behavior characteristic capable of representing the action of the target object can be determined by the space-time diagram.

Since the motion of the object is temporally continuous, if the joint features of the joint portions are specified by the positions of the joint portions in each video frame, that is, if there is one joint feature for each joint portion in each video frame, the spatial relationship of the joint portions of the object can be represented by only one limb configuration diagram, and it is difficult to represent the temporal change in the motion of the object. Thus, the space-time diagram may also be of this form: the space-time diagram may include a plurality of limb structure diagrams connecting nodes of the same joint part, and because the video to be recognized includes consecutive video frames, each video frame in the video to be recognized may correspond to one limb structure diagram, nodes representing the same joint part in different limb structure diagrams are connected according to a time sequence of each video frame, that is, the space-time diagram may include a limb structure diagram corresponding to each video frame in the video to be recognized, and a connection relationship between one limb structure diagram and each limb structure diagram may be respectively as shown in fig. 2A and 2B.

Fig. 2A is a schematic diagram of a limb structure provided in this specification.

Fig. 2B is a schematic diagram of a connection relationship between the structures of the various limbs provided in this specification.

In both fig. 2A and fig. 2B, the limb structure diagrams are shown by taking the target person as an example, the structure of one limb structure diagram can represent the connection relationship of each joint part included in the human body, and the structure of the human body can be directly represented by the limb structure diagram, and each node represents one joint part. In fig. 2B, an example of connecting two limb structure diagrams is shown, and in order to distinguish different limb structure diagrams, nodes of the two limb structure diagrams are respectively represented in different colors, and it can be seen that nodes representing the same joint part in the two limb structure diagrams are connected.

In order to show the change of each joint part of the object in action through each limb structure diagram, the joint features of each joint part corresponding to each video frame need to be mapped to the nodes of each limb structure diagram, specifically, for each video frame, the joint features of each joint part under the video frame are determined according to the position of each joint part of the object in the video frame, the node initial features of each node in the limb structure diagram corresponding to the video frame are determined according to the joint features of each joint part under the video frame, and the space-time diagram is determined according to the node initial features of each node in the limb structure diagram corresponding to each video frame.

That is, the space-time diagram may include not only the limb structure diagram corresponding to each video frame, but also the node initial characteristics indicated by the joint characteristics of the joint parts, that is, the space-time diagram includes not only the limb structure diagram indicating the relationship between the joint parts of the target object, but also the motion characteristics of the target object identified in the video to be recognized in a continuous time. Therefore, the joint parts can be connected through the behavior characteristics determined by the space-time diagram, the characteristics of the action of the target object in the video to be recognized are extracted, and the behavior of the target object can be accurately determined according to the behavior characteristics.

Since the nodes in the limb structure diagram are actually used for representing joint parts, and when determining the behavior characteristics, the connection relationship between the joint parts needs to be introduced through the limb structure diagram in the space-time diagram, when determining the behavior characteristics, for any one limb structure diagram included in the space-time diagram, the node initial characteristics of the nodes can be adjusted according to the node initial characteristics of the nodes having the connection relationship with the nodes in the limb structure diagram, so as to obtain the node characteristics of the nodes, and the behavior characteristics of the target object can be determined according to the node characteristics of each node in the space-time diagram.

When determining the node characteristics of a node, the node and a node having a connection relationship with the node may both be used as target nodes, and the node characteristics of the node are determined according to the node initial characteristics of the target nodes, where the target nodes may be classified into three categories, including: different types of target nodes have different network parameters corresponding to the node itself, a node closer to a central point (such as the central point shown in fig. 2A) relative to the node, and a node farther from the central point relative to the node, and when determining the node characteristics of the node, it needs to be determined by the network parameter corresponding to each target node and the node initial characteristics of the target node, and it can be understood that when aggregating the node initial characteristics of each target node, the weight of the node initial characteristics of the target node may be the network parameter corresponding to the node.

According to the method, the behavior characteristics capable of showing the behaviors of the target object in the video to be recognized are determined through the space-time diagram, so that node characteristics (or initial node characteristics) of nodes showing the same joint parts in the space-time diagram can be aggregated to obtain aggregated characteristics corresponding to each joint part, and the behavior characteristics of the target object are determined according to the aggregated characteristics corresponding to each joint part.

There are various ways to aggregate the node features, for example, an average value, a maximum value, and the like may be taken for each node feature to obtain the aggregated feature. There may be various ways to determine the behavior feature of the target object through the post-aggregation features corresponding to each joint part, for example, an average feature vector of each post-aggregation feature may be determined; for another example, the weight corresponding to each joint part may be determined, and the aggregated feature vector corresponding to each joint part is subjected to weighted summation by the weight corresponding to each joint part, so as to obtain the behavior feature of the target object.

It should be noted that, in this specification, the method may be applied to behavior recognition of an object by an unmanned device, and therefore, a video to be recognized may be a video acquired by the unmanned device, the object may be an object around the unmanned device, and the unmanned device may determine a control strategy for the unmanned device according to a behavior of the object, so as to control the unmanned device.

For example, assuming that the target object is a pedestrian, if the unmanned device determines that the behavior of the target object is being recruited, the unmanned device may stop before reaching the position of the target object; for another example, if the drone determines that the behavior of the target object is traversing a road, deceleration may begin. Of course, the unmanned device may recognize the behavior of the passenger in the vehicle in addition to the behavior of the surrounding pedestrians.

In the method for identifying the behavior of the target object in the present specification, the method for identifying the behavior of the target object may be applied to not only the behavior of the target object identified by the unmanned device, but also other business scenarios, for example, in a human-computer interaction scenario, a device such as a mobile phone, a wearable device, or a tablet may identify the behavior of a user using the device by the method, and determine how to perform the human-computer interaction according to the behavior of the user, and therefore, an execution subject in the method may also refer to the mobile phone, the wearable device, the tablet, or the like.

In this specification, the recognition model may include different network layers, and specifically, the recognition model may include a video feature extraction layer and a prediction layer, the video feature extraction layer is configured to determine a video feature corresponding to a video to be recognized, and the prediction layer may be configured to determine a behavior feature and determine a behavior of a target object, that is, a space-time diagram may be input to the prediction layer to obtain the behavior feature and determine the behavior of the target object according to the behavior feature by the prediction layer, and the prediction layer mentioned in the subsequent content may include a first network layer and a second network layer, so that the space-time diagram may be input to the second network layer to obtain the behavior feature, and the behavior feature may be input to the first network layer to obtain a behavior of the recognized target object.

The above describes the method in terms of application of a recognition model that needs to be trained in advance, and the following describes the training process of the recognition model, as shown in fig. 3.

Fig. 3 is a schematic flowchart of a method for training a model provided in this specification, and specifically includes the following steps:

s301: obtaining a first training sample, wherein the first training sample comprises a first sample video and annotation data corresponding to the first sample video.

S302: and inputting the first sample video into a preset identification model to obtain the video characteristics corresponding to the first sample video.

S303: according to the video features, determining joint features of each joint part of the target object in the first sample video, determining a space-time diagram according to the joint features corresponding to each joint part, and determining behavior features of the target object in the first sample video according to the space-time diagram, wherein the space-time diagram is used for representing the connection relation between each joint part of the target object in space and the motion situation of each joint part of the target object in time.

S304: and determining a recognition result representing the behavior of the target object according to the behavior characteristics, and training the recognition model by taking the minimized deviation between the recognition result and the labeled data corresponding to the first sample video as an optimization target.

For convenience of description, the method of training the model in this specification will be described with reference to the server as the execution subject.

Firstly, a server can obtain a first training sample, the first training sample comprises a first sample video and annotation data corresponding to the first sample video, then the server can input the first sample video into a preset recognition model to obtain video characteristics corresponding to the first sample video, determine joint characteristics of each joint part of a target object in the first sample video according to the video characteristics, determine a space-time diagram according to the joint characteristics corresponding to each joint part, and determine behavior characteristics of the target object in the first sample video according to the space-time diagram, wherein the space-time diagram is used for representing the connection relation between each joint part of the target object in space and the motion situation of each joint part of the target object in time, representing the recognition result of the behavior of the target object according to the behavior characteristics, and minimizing the deviation between the recognition result and the annotation data corresponding to the first sample video as an optimization target, the recognition model is trained.

It should be noted that the recognition model may include network layers with different functions, and the network layers with different functions jointly implement the functions of the recognition model, specifically, the recognition model may include a video feature extraction layer and a prediction layer, where the video feature extraction layer is used to extract video features, and the prediction layer is used to determine behaviors of a target object, and of course, the prediction layer may include a network layer (may be referred to as a first network layer) for determining behaviors through a space-time diagram (also may be referred to as a second network layer) in addition to the network layer (may be referred to as a first network layer) for determining behaviors of the target object, and when the recognition model is trained, the video feature extraction layer and the prediction layer may be jointly trained.

It can be seen from the above method that, when determining the behavior feature of the target object, the method combines a space-time diagram capable of representing the connection relationship between the joint parts of the target object and the motion situation of each joint part of the target object and the joint features corresponding to each joint part, so that the joint parts are not independently viewed when determining the behavior feature, and the relationship between the joint parts is referred to, therefore, compared with the prior art, the method for determining the behavior of the target object by using the behavior feature improves the accuracy of behavior recognition.

Based on the same idea, the present specification also provides a corresponding behavior recognition apparatus and model training apparatus, as shown in fig. 4 and 5.

Fig. 4 is a schematic diagram of a behavior recognition apparatus provided in this specification, which specifically includes:

the acquiring module 401 is configured to acquire a video to be identified, where the video to be identified includes a motion video of a target object;

an input module 402, configured to input the video to be recognized into a recognition model trained in advance, to obtain a video feature corresponding to the video to be recognized;

a feature module 403, configured to determine, according to the video features, joint features corresponding to each joint part of the target object;

a determining module 404, configured to determine a space-time diagram according to the joint features corresponding to each joint part, where the space-time diagram is used to represent a connection relationship between each joint part of the target object in space and a motion situation of each joint part of the target object in time;

and the identification module 405 is configured to determine a behavior feature of the target object according to the space-time diagram, and perform behavior identification on the target object according to the behavior feature.

the input module 402 is specifically configured to input the video to be identified into the video feature extraction layer, to obtain a feature matrix corresponding to the video to be identified, and use the feature matrix as a video feature corresponding to the video to be identified; the recognition module 405 is specifically configured to input the behavior feature into the prediction layer, and determine the behavior of the target object.

Optionally, each video frame in the video to be recognized corresponds to a limb structure diagram, the space-time diagram includes a limb structure diagram corresponding to each video frame in the video to be recognized, each node included in one limb structure diagram is used for representing a joint part of a target object, edges between the nodes are used for representing a relationship between the joint parts, and in the space-time diagram, the nodes representing the same joint part in different limb structure diagrams are connected according to a time sequence of each video frame.

Optionally, the feature module 403 is specifically configured to, for each video frame, determine joint features of joint parts of the target object from the video features according to positions of the joint parts in the video frame; the determining module 404 is specifically configured to, for each video frame, determine, according to joint features of each joint part in the video frame, node initial features of each node in the limb structure diagram corresponding to the video frame; and determining the space-time diagram according to the node initial characteristics of each node in the limb structure diagram corresponding to each video frame.

Optionally, the identifying module 405 is specifically configured to, for any one of the limb structure diagrams included in the space-time diagram, adjust, for each node in the limb structure diagram, the node initial feature of the node according to the node initial feature of the node in the limb structure diagram, where the node has a connection relationship with the node, to obtain the node feature of the node; and determining the behavior characteristics of the target object according to the node characteristics of each node in the space-time diagram.

Optionally, the identification module 405 is specifically configured to aggregate node features of nodes representing the same joint location in each limb structure diagram in the space-time diagram, so as to obtain aggregated features corresponding to each joint location; and determining the behavior characteristics of the target object according to the polymerized characteristics corresponding to each joint part.

the device further comprises: and the control module 406 is configured to determine a control strategy for the unmanned device according to the identified behavior of the target object, and control the unmanned device.

Fig. 5 is a schematic diagram of a model training apparatus provided in this specification, which specifically includes:

an obtaining module 501, configured to obtain a first training sample, where the first training sample includes a first sample video and annotation data corresponding to the first sample video;

an input module 502, which inputs the first sample video into a preset recognition model to obtain a video feature corresponding to the first sample video;

a determining module 503, configured to determine, according to the video features, joint features of each joint part of the target object in the first sample video, determine a space-time diagram according to the joint features corresponding to each joint part, and determine, according to the space-time diagram, behavior features of the target object in the first sample video, where the space-time diagram is used to represent a connection relationship between each joint part of the target object in space and a motion situation of each joint part of the target object in time;

a training module 504, configured to determine, according to the behavior feature, a recognition result indicating a behavior of the target object, and train the recognition model with minimizing a deviation between the recognition result and the labeled data corresponding to the first sample video as an optimization target.

Optionally, the recognition model includes a video feature extraction layer to be trained and a prediction layer to be trained, the video feature extraction layer is used for extracting video features, and the prediction layer is used for determining behaviors of the target object; the training module 504 is specifically configured to perform joint training on the video feature extraction layer and the prediction layer with a goal of minimizing a deviation between the recognition result and the labeled data corresponding to the first sample video as an optimization goal.

The present specification also provides a computer-readable storage medium having stored thereon a computer program, the computer program being operable to perform the method of behavior recognition and the method of model training provided in fig. 1 or 3 above.

This specification also provides a schematic block diagram of the electronic device shown in fig. 6. As shown in fig. 6, at the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, but may also include hardware required for other services. The processor reads a corresponding computer program from the non-volatile memory into the memory and then runs the computer program to implement the behavior recognition method and the model training method described in fig. 1 or fig. 3. Of course, besides the software implementation, the present specification does not exclude other implementations, such as logic devices or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may be hardware or logic devices.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Hardware Description Language), traffic, pl (core universal Programming Language), HDCal (jhdware Description Language), lang, Lola, HDL, laspam, hardward Description Language (vhr Description Language), vhal (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the various elements may be implemented in the same one or more software and/or hardware implementations of the present description.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only an example of the present specification, and is not intended to limit the present specification. Various modifications and alterations to this description will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the scope of the claims of the present specification.

Claims

1. A method of behavior recognition, comprising:

determining joint features corresponding to each joint part of the target object according to the video features;

2. The method of claim 1, wherein the recognition model comprises a video feature extraction layer and a prediction layer;

according to the behavior characteristics, performing behavior recognition on the target object, specifically comprising:

3. The method according to claim 1, wherein each video frame in the video to be recognized corresponds to a limb structure diagram, the space-time diagram includes a limb structure diagram corresponding to each video frame in the video to be recognized, each node included in one limb structure diagram is used for representing a joint part of a target object, edges between nodes in one limb structure diagram are used for representing a relationship between joint parts, and in the space-time diagram, nodes representing the same joint part in different limb structure diagrams are connected according to a time sequence of each video frame.

4. The method according to claim 3, wherein determining the joint feature corresponding to each joint part of the object according to the video feature specifically comprises:

5. The method according to claim 4, wherein determining the behavior characteristic of the target object according to the space-time diagram specifically comprises:

6. The method according to claim 4 or 5, wherein determining the behavior characteristic of the target object according to the space-time diagram specifically comprises:

7. The method of claim 1, wherein the video to be identified is captured by an unmanned device, and the target objects are target objects around the unmanned device;

the method further comprises the following steps:

8. A method of model training, comprising:

9. The method of claim 8, wherein the recognition model comprises a video feature extraction layer to be trained and a prediction layer to be trained, the video feature extraction layer is used for extracting video features, and the prediction layer is used for determining target behavior;

10. An apparatus for behavior recognition, comprising:

the device comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring a video to be recognized, and the video to be recognized comprises a motion video of a target object;

11. An apparatus for model training, comprising:

12. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1 to 7 or 8 to 9.

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the method of any of claims 1 to 7 or 8 to 9.