CN113887501A

CN113887501A - Behavior recognition method and device, storage medium and electronic equipment

Info

Publication number: CN113887501A
Application number: CN202111229367.7A
Authority: CN
Inventors: 邓玥琳; 苏锦萍; 刘岩; 李驰; 杨颜如
Original assignee: Taikang Insurance Group Co Ltd
Current assignee: Taikang Insurance Group Co Ltd
Priority date: 2021-10-21
Filing date: 2021-10-21
Publication date: 2022-01-04
Anticipated expiration: 2041-10-21

Abstract

The present disclosure relates to the field of computer technologies, and in particular, to a behavior recognition method, a behavior recognition apparatus, a storage medium, and an electronic device. The behavior recognition method comprises the following steps: acquiring a plurality of skeletal data characteristics of a target object; performing branch processing on the multiple bone data characteristics by using a branch displacement convolution sub-network to obtain multiple characteristic mappings; performing mainstream fusion on the multiple feature mappings through a fusion displacement convolution sub-network to obtain a feature vector; and identifying by utilizing a full connection layer according to the characteristic vector to obtain the behavior information of the target object. The behavior identification method provided by the disclosure can solve the problem of large calculation amount when the behavior identification is carried out based on the skeleton in the prior art.

Description

Behavior recognition method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a behavior recognition method, a behavior recognition apparatus, a storage medium, and an electronic device.

Background

In the scenes of high-precision automatic behavior recognition and processing required by old people, major health and financial business, a behavior recognition task is usually required to be carried out based on bones, and the recognition result can be used for behavior analysis or prediction and has high use value.

At present, the mainstream behavior recognition algorithm is based on a deep learning and neural network method, and is divided according to the utilized modes and methods, and mainly comprises the following steps: (1) a method based on a 2D convolutional neural network and a cyclic neural network is characterized in that the method comprises the following representative steps: LRCN (Long-term Recurrent conditional Networks), etc.; (2) a typical method based on the 2D convolutional neural network + optical flow is as follows: TSN (Temporal Segment Networks, time Segment Networks), etc.; (3) a representative method based on a 3D convolutional neural network is: C3D (volumetric 3D, three-dimensional Convolutional neural network), I3D (inflected 3D, dual-stream three-dimensional Convolutional neural network), and the like; (4) a representative method based on the skeletal point and neural network comprises the following steps: ST-GCN (Spatial Temporal Graph relational Networks, spatio-Temporal convolution network), and the like.

In behavior recognition with a target object such as a human being as a main body and a center, the method of the skeletal point + the neural network has stronger robustness and relatively better effect in practice. However, in such a modality, the complexity and parameterization are excessive, and the network usually comprises a multi-stream structure with a large number of model parameters, which results in a complex training process and high computational cost. There is therefore a need to propose a faster, stronger and more efficient bone-based behavior recognition model to accomplish this task.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure is directed to a behavior recognition method, a behavior recognition apparatus, a storage medium, and an electronic device, and aims to solve the problem of a large amount of computation in performing behavior recognition based on a skeleton in the prior art.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to an aspect of the embodiments of the present disclosure, there is provided a behavior recognition method, including: acquiring a plurality of skeletal data characteristics of a target object; performing branch processing on the multiple bone data characteristics by using a branch displacement convolution sub-network to obtain multiple characteristic mappings; performing mainstream fusion on the multiple feature mappings through a fusion displacement convolution sub-network to obtain a feature vector; and identifying by utilizing a full connection layer according to the characteristic vector to obtain the behavior information of the target object.

According to some embodiments of the present disclosure, based on the foregoing scheme, the acquiring a plurality of bone data characteristics of the target object includes: acquiring image data within preset time collected by a camera according to a preset frame rate set; detecting bone key points of the target object in each frame of the image data to obtain the bone key point sequence; performing feature processing on the bone key point sequence to obtain the multiple items of bone data features; wherein the plurality of items of bone data features comprise at least two items of bone key point sequence features of a first coordinate system, bone key point features of a second coordinate system, speed features and skeleton features.

According to some embodiments of the present disclosure, based on the foregoing scheme, the performing branch processing on the plurality of bone data features by using a branch-shift convolution sub-network to obtain a plurality of feature maps includes: inputting all the bone data characteristics into the branch displacement convolution sub-network which is corresponding to the bone data characteristics and is trained in advance in parallel to obtain characteristic mappings corresponding to all the bone data characteristics respectively; the branch displacement convolution sub-network is constructed by sequentially connecting a batch normalization layer, a displacement map convolution initial block and at least one displacement map convolution block in series.

According to some embodiments of the present disclosure, based on the foregoing solution, the performing mainstream fusion on the multiple feature maps through the fusion displacement convolution sub-network to obtain a feature vector includes: uniformly inputting the multiple feature mappings into the pre-trained fusion displacement convolution sub-network to obtain the feature vectors; the fusion displacement convolution sub-network is constructed by sequentially connecting at least one displacement graph rolling block in series.

According to some embodiments of the present disclosure, based on the foregoing scheme, the displacement map convolution block includes a spatial displacement operation module, a spatial point-by-point convolution module, a temporal displacement operation module, and a temporal point-by-point convolution module.

According to some embodiments of the present disclosure, based on the foregoing, the method further includes performing model training on the branch displacement convolution sub-network, the fusion displacement convolution sub-network, and the fully-connected layer, the model training including: acquiring a bone data training set and a behavior label of the bone data training set; constructing an original input graph according to the bone data training set, and deleting the original input graph to obtain a target input graph; identifying by using the branch displacement convolution sub-network, the fusion displacement convolution sub-network and the full connection layer based on the target input graph to obtain identification behavior information; and comparing the identification behavior information with the behavior label to modify the model parameters of the branch displacement convolution sub-network, the fusion displacement convolution sub-network and the full connection layer.

According to some embodiments of the present disclosure, based on the foregoing scheme, the performing edge deletion on the original input graph to obtain a target input graph includes: constructing an original adjacency matrix according to the displacement relation in the original input graph; calculating a target adjacency matrix corresponding to the target input graph based on the original adjacency matrix and a preset random edge retention rate; or constructing an attention template adjacency matrix according to the displacement relation in the original input graph; calculating the edge retention probability of each edge of the original input graph based on the attention template adjacency matrix and a preset self-adaptive edge deletion parameter; and acquiring a target adjacency matrix corresponding to the target input graph according to the edge retention probability.

According to a second aspect of the embodiments of the present disclosure, there is provided a behavior recognition apparatus including: the acquisition module is used for acquiring a plurality of skeletal data characteristics of the target object within preset time; the branch module is used for performing branch processing on the multiple bone data characteristics by using a branch displacement convolution sub-network to obtain multiple characteristic mappings; the fusion module is used for carrying out mainstream fusion on the multiple feature mappings through a fusion displacement convolution sub-network to obtain a feature vector; and the identification module is used for identifying according to the characteristic vector by utilizing the full connection layer to obtain the behavior information of the target object.

According to a third aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a behavior recognition method as in the above embodiments.

According to a fourth aspect of the embodiments of the present disclosure, there is provided an electronic apparatus, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the behavior recognition method as in the above embodiments.

Exemplary embodiments of the present disclosure may have some or all of the following benefits:

in some embodiments of the present disclosure, a plurality of bone data features of a target object within a preset time are first processed in a branch manner by using a branch displacement convolution sub-network, then mapped and shot by using a fusion displacement convolution sub-network to perform mainstream fusion, and finally behavior information is identified according to a feature vector by using a full connection layer. On one hand, the multi-flow structure identified by using the characteristics of a plurality of skeletal data can provide richer characteristics with identification capability, and the accuracy of the model is improved; on the other hand, a branch displacement convolution sub-network and a fusion displacement convolution sub-network are designed by adopting the idea of early fusion, a plurality of input branches are designed, and a main stream is applied after a plurality of characteristic branches are connected in series, so that the fusion displacement convolution sub-network can participate in skeleton data fusion prematurely, not only are abundant input characteristics reserved, but also the complexity and redundancy of a model are remarkably inhibited, and the training process is easier to converge.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:

FIG. 1 schematically illustrates a flow diagram of a behavior recognition method in an exemplary embodiment of the disclosure;

fig. 2(a) schematically illustrates a diagram of skeletal keypoints with N ═ 18 in an exemplary embodiment of the present disclosure;

fig. 2(b) schematically illustrates a diagram of skeletal keypoints with N-25 in an exemplary embodiment of the present disclosure;

fig. 3(a) schematically illustrates a schematic diagram of a late-stage fusion architecture in an exemplary embodiment of the present disclosure;

FIG. 3(b) schematically illustrates a schematic diagram of an early fusion architecture in an exemplary embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating an early fused multiple input branching architecture in an exemplary embodiment of the present disclosure;

FIG. 5 is a diagram schematically illustrating a displacement map volume block structure in an exemplary embodiment of the present disclosure;

FIG. 6 schematically illustrates a flow diagram of a model training method in an exemplary embodiment of the disclosure;

FIG. 7(a) schematically illustrates a generic non-local displacement map in an exemplary embodiment of the disclosure;

FIG. 7(b) schematically illustrates a generic non-local displacement process in an exemplary embodiment of the disclosure;

FIG. 7(c) schematically illustrates a generic non-local displacement signature in an exemplary embodiment of the disclosure;

FIG. 8(a) is a non-local displacement graph schematically illustrating a band edge deletion in an exemplary embodiment of the present disclosure;

FIG. 8(b) is a schematic diagram illustrating a non-local displacement process for band edge deletion in an exemplary embodiment of the present disclosure;

FIG. 8(c) is a schematic illustration of a non-local displacement signature for band edge deletion in an exemplary embodiment of the present disclosure;

FIG. 9(a) schematically illustrates a skeletal keypoint diagram at 0s in an exemplary embodiment of the disclosure;

FIG. 9(b) schematically illustrates a 5s bone keypoint diagram in an exemplary embodiment of the disclosure;

FIG. 9(c) schematically illustrates a skeletal keypoint diagram at 10s in an exemplary embodiment of the disclosure;

fig. 10 schematically illustrates a composition diagram of a behavior recognition apparatus in an exemplary embodiment of the present disclosure;

FIG. 11 schematically illustrates a schematic diagram of a computer-readable storage medium in an exemplary embodiment of the disclosure;

fig. 12 schematically shows a structural diagram of a computer system of an electronic device in an exemplary embodiment of the disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

In behavior recognition with a target object such as a human being as a main body and a center, the method of the skeletal point + the neural network has stronger robustness. And in all methods based on the skeleton point + neural network, compared with the methods based on the skeleton point + neural network and the skeleton point + cyclic neural network, the method based on the skeleton point + space-time graph convolutional network can better utilize the non-Euclidean topological structure characteristics of the skeleton, and learn distinctive and rich characteristics from the skeleton sequence, such as a SOTA (state-of-the-art) skeleton point + space-time graph convolutional network model.

For the SOTA skeleton point + space-time diagram convolution network model, on one hand, the network structure is usually very complex and parameterization is excessive; on the other hand, the network usually comprises a multi-stream structure with a large number of model parameters, which results in a complex training process and higher computational costs.

Therefore, in order to reduce the huge amount of computation of redundancy caused by the GCN convolution operation and multi-stream late fusion, the technical scheme of the application is provided, and a multi-Input branch Shift Graph convolution Network structure (Multiple Input Branches Shift Graph constraint with a Dropedge, MIBSD-GCN) with early fusion is provided.

Implementation details of the technical solution of the embodiments of the present disclosure are set forth in detail below.

Fig. 1 schematically illustrates a flow chart of a behavior recognition method in an exemplary embodiment of the present disclosure. As shown in fig. 1, the behavior recognition method includes steps S1 to S4:

step S1, acquiring a plurality of bone data characteristics of the target object within preset time;

step S2, a branch displacement convolution sub-network is used for carrying out branch processing on the multiple bone data characteristics to obtain multiple characteristic mappings;

step S3, carrying out mainstream fusion on the multiple feature maps through a fusion displacement convolution sub-network to obtain feature vectors;

and step S4, recognizing according to the characteristic vector by using a full connection layer to obtain the behavior information of the target object.

Hereinafter, each step of the behavior recognizing method in the present exemplary embodiment will be described in more detail with reference to the drawings and examples.

In step S1, a plurality of bone data features of the target object within a preset time are obtained.

In one embodiment of the present disclosure, the present application performs behavior recognition based on bone points, the starting point is the bone, and the extraction of the bone points and the pose estimation are not involved. Therefore, it is necessary to acquire the bone data characteristics of the target object and to use a period of dynamic sequence data with a time length as a basis for analysis. The preset time can be specifically set according to requirements, for example, 5 seconds, 3 seconds, and the like.

For bone fusion analysis, multiple bone data features are also required. Wherein the plurality of items of bone data features include at least two of bone keypoint features of a first coordinate system, bone keypoint features of a second coordinate system, velocity features, and skeletal features.

It should be noted that, at different times within the preset time, the bone keypoints of the target object all have corresponding positions, and therefore the bone keypoint features of the first coordinate system, the bone keypoint features of the second coordinate system, the velocity features, and the skeleton features all have sequence data features at different times.

The bone key point features of the first coordinate system may be bone key point features (Cartesian Coordinates) of a Cartesian coordinate system, and include absolute Coordinates and relative coordinate information of each bone point in the Cartesian coordinate system at different time instants.

The bone key point features of the second coordinate system may be bone key point features (Spherical Coordinates) of a Spherical coordinate system, including absolute Coordinates and relative coordinate information of each bone point in the Spherical coordinate system at different time instants.

The Velocity characteristic (Velocity) refers to the first-order difference and the second-order difference information of the bone motion change in adjacent moments.

Skeletal features (Bone) refer to length and angle information of each Bone.

Further, in one embodiment of the present disclosure, the bone key point sequence may be first obtained in step S1, and then the bone key point sequence is converted into a plurality of bone data features. The acquiring of a plurality of bone data characteristics of a target object comprises:

step S11, acquiring image data within a preset time collected by a camera according to a preset frame rate set;

step S12, detecting bone key points of the target object in each frame of the image data to obtain the bone key point sequence;

and step S13, performing feature processing on the bone key point sequence to obtain the multiple items of bone data features.

Steps S11 to S12 are processes for extracting a bone key point sequence using a camera. When the camera is used for collecting, image data in a preset time can be collected, and image data of a period of time can be intercepted from all the image data.

The acquired sequence of bone keypoints may be 2D coordinate data of bone points, or 3D coordinate data, depending on the environmental space in which the target object is located.

In the extraction process, the extraction process has slight differences according to different functions of the cameras. For example, a common optical camera may be used to collect image data, and then a classical human body posture estimation algorithm, such as openpos, a deep learning network, etc., may be used to detect skeletal key points. The depth camera with a special sensor, such as a common microsoft Kinect camera, can also be used, and the camera can directly acquire image data with depth information so as to detect and obtain the bone key points.

Specifically, taking the detection of key points of a human body as an example, fig. 2 schematically illustrates a schematic diagram of a bone key point in an exemplary embodiment of the disclosure. Fig. 2(a) shows a schematic diagram of bone key points with N ═ 18, that is, a human body is composed of 18 bone key points, fig. 2(b) shows a schematic diagram of bone key points with N ═ 25, that is, a human body is composed of 25 bone key points, and the detected sequence of bone key points includes related information of these bone key points.

In the actual operation process, a depth camera can be adopted to obtain real-time image data according to a specified frame rate FPS, a segment is taken as a T frame, no more than p individual bone sequences are extracted from each segment, each human body comprises N bone key points, and v is used_tiRepresenting the ith bone key point of the t-th frame in the segment, a human bone sequence can be represented as V ═ { V ═ V in the video segment_tiI T1, K, T, i 1, K, N }. Among them, the commonly used N-18 or N-25, defined according to different skeletal key points, can be implemented in a selected manner as shown in fig. 2(a) and fig. 2 (b).

In step S13, in order to complement and enhance different feature expressions, a plurality of bone data features can be obtained by performing feature processing on the bone key point sequence. For example, the obtained skeleton key point sequence can be converted into 4 branches, 8 data formats, including skeleton joint point features (absolute coordinates and relative coordinates) of a cartesian coordinate system, skeleton joint point features (absolute coordinates and relative coordinates) of a spherical coordinate system, velocity features (first-order difference and second-order difference information of motion changes in adjacent moments), and skeleton features (skeleton length and angle).

The skeleton key point sequence is converted into skeleton key point features, speed features and skeleton features of a Cartesian coordinate system, the skeleton key point features, the speed features and the skeleton features are common in behavior recognition based on skeleton points in the field, the conversion process can be completed by utilizing the prior art, and redundant description is omitted here.

It should be noted that the data representation of the spherical coordinate branch is a characteristic that the present disclosure is distinguished from other schemes, and a specific conversion manner thereof may be converted according to the bone key point characteristics of the cartesian coordinate system, which is described in detail below.

The ball coordinate skeleton point characteristics can more clearly display the change process of specific skeleton motion from another coordinate system, such as actions of waving hands, turning circles, falling and the like in human skeleton recognition, and further the accuracy of action recognition can be better improved.

Supposing human skeleton key point v_tiHas three-dimensional coordinates of (x, y, z). The conversion relationship between the cartesian coordinate system and the corresponding spherical coordinate system is shown in equations (1) to (3):

where r is from the origin o to the keypoint v_tiTheta is the distance connecting the origin o and the key point v_tiIs shown as the polar angle, and the z-axis.

Is connecting the origin o and the key point v_tiThe angle between the projection line of the connecting line of (a) and the x-axis on the xy-plane, expressed as azimuth.

According to the method, the Cartesian coordinate system and the spherical coordinate system can be converted. In the calculation, it should be ensured that θ is within the range (- π/2, π/2), and

within the range (-pi, pi).

In one embodiment of the present disclosure, the technical effect of introducing a spherical coordinate bone point feature may be explained using an ablation experiment.

Ablation experiments comparative experiments were designed using the NTU RGB + D60X-View dataset. The data set contained 60 types of movements, 56880 samples in total, of which 40 types were daily behavior movements, 9 types were health-related movements, and 11 types were double-person interactions. These actions are performed by 40 people from 10 to 35 years of age. The data set was acquired by microsoft Kinect v2 sensor and three different angle cameras were used, the acquired data form including depth information, 3D bone information, RGB frames and infrared sequences.

In the experimental design, the model accuracy of the 3-stream and the 4-stream is considered by using the network composed of different unit blocks. The 3-stream is to perform behavior recognition by using skeleton key point characteristics, speed characteristics and skeleton characteristic 3 skeleton data characteristics of a Cartesian coordinate system; the 4-stream is used for behavior recognition by using the skeleton key point characteristics of a Cartesian coordinate system, the skeleton key point characteristics of a spherical coordinate system, the speed characteristics and the skeleton characteristics of 4 skeleton data.

The ResGCN is a basic displacement graph rolling machine network model constructed by using basic unit blocks ResGCN, and is an implementation mode of a displacement graph rolling machine network model in the prior art; the Shift-GCN, that is, the displacement charter network model provided by the present disclosure after integrating the early fusion framework (MIB) and the Shift-GCN block, is a new network model provided by the present disclosure, and here, only the technical effects of introducing the features of the spherical coordinate skeletal points are compared, so the contents of the model will be described in detail later.

TABLE 1 precision comparison of different multiple input branches using ResGCN

Table 1 shows the comparison of model accuracy before and after introducing the spherical coordinates skeleton point feature when performing behavior recognition using ResGCN as a cell block of the network. Referring to table 1, the calculation accuracy of the model when 3-stream is input to the branch is 96.0%, while the accuracy of the model when 4-stream is input to the branch, i.e., after introducing the spherical coordinate skeletal point feature, is 96.2%, which is 0.2% higher than 96.0%.

TABLE 2 precision comparison of different multiple input branches using Shift-GCN

Table 2 shows the comparison of model accuracy before and after introducing the feature of the spherical coordinate skeleton point when performing behavior recognition using Shift-GCN as a cell block of the network. Referring to table 2, the calculation accuracy of the model when 3-stream is input to the branch is 95.8%, the calculation accuracy of the model when 4-stream is input to the branch is 96.0%, and the accuracy is improved after the spherical coordinate skeleton point feature is introduced.

Therefore, according to the results in tables 1 and 2, after the spherical coordinate feature is added, the accuracy of the model is improved by 0.2% no matter the neural network model is constructed by using ResGCN or Shift-GCN as the unit Block, and the effectiveness of the improvement of the model accuracy after the spherical coordinate feature is added is verified.

Based on the method, the multi-stream structure is used as the basis of action recognition, and richer characteristics with identification capability are provided, so that the accuracy of the model is improved. Especially, the method introduces the characteristics of a spherical coordinate skeleton point as one of input branches, converts the Cartesian coordinates (x, y, z) of the skeleton into the spherical coordinates

The polar coordinate expression of the skeleton is obtained, so that information and characteristics on the space position and motion change are enhanced and supplemented, and the motion recognition precision can be effectively improved.

Although the fractional-layer fusion of the multi-flow structure brings excellent model precision, the size and the calculation cost of the model are greatly increased, and many parameters in the multi-flow model are redundant, so that the problem of reducing the calculation amount in the subsequent bone fusion analysis is needed to be further solved.

At present, most of multi-stream structures adopt a late-stage fusion mode for action recognition, but generally, the computation amount of the late-stage fusion is large, and training needs to be repeated for many times, so that the method is realized by designing an early-stage fusion multi-input branch architecture model, fusing a plurality of input branches in the early stage of the model, and then extracting the overall recognition features by using a main stream.

Fig. 3 schematically illustrates two fusion architectures in an exemplary embodiment of the present disclosure, in which fig. 3(a) illustrates a schematic diagram of a late fusion architecture, and fig. 3(b) illustrates a schematic diagram of an early fusion architecture. The data 1 and the data 2 are multi-stream data input into the neural network, the cell Block is a calculation Block of a network forming unit, and the data are output through a classifier finally. Of course, fig. 3 is only an exemplary structure, and in practical applications, data may have a structure with more streams, and the composition of cells in the network model is also different.

In fig. 3(a), data 1 and data 2 are respectively calculated by three cells and finally fused into a classifier; in fig. 3(b), the data 1 and the data 2 are merged after one cell is calculated, and the

subsequent cells

2 and 3 only need to calculate the merged data, so that the calculation process of the

cells

2 and 3 is omitted.

Referring to fig. 3, compared with a later-stage fusion framework, in an early-stage fusion framework, a plurality of input branches can be fused in an early stage of a model, and a main stream is applied after a plurality of characteristic branches are connected in series, so that not only are rich input characteristics retained, but also complexity and redundancy of the model are remarkably suppressed, a training process is easier to converge, and the calculation amount is greatly reduced.

Steps S2 to S4 are performed by using a pre-trained early-fusion multi-input branch architecture model to recognize the actions of the plurality of skeletal data features. Fig. 4 schematically illustrates a schematic diagram of an early-fused multi-input branching architecture in an exemplary embodiment of the present disclosure, and referring to fig. 4, an early-fused multi-input branching architecture model includes three major parts, namely a branch-shift convolution sub-network, a fusion-shift convolution sub-network, and a full-link layer.

In step S2, a branch displacement convolution sub-network is used to perform branch processing on the plurality of bone data features to obtain a plurality of feature maps.

In an embodiment of the present disclosure, the performing branch processing on the plurality of bone data features using a branch-and-shift convolution sub-network to obtain a plurality of feature maps includes: and inputting all the bone data characteristics into the pre-trained branch displacement convolution sub-network corresponding to the bone data characteristics in parallel to obtain characteristic mappings respectively corresponding to all the bone data characteristics.

Specifically, the branch-shift convolution sub-network has a plurality of branch input ports, and different bone data features have corresponding input ports. The branch displacement convolution sub-network can be constructed by sequentially connecting a batch normalization layer, a displacement map convolution initial block and at least one displacement map convolution block in series.

Referring to fig. 4, the branch-Shift convolution sub-network includes four branch inputs of a skeleton key point feature (Cartesian Coordinates) of a Cartesian coordinate system, a skeleton key point feature (Spherical Coordinates) of a Spherical coordinate system, a Velocity feature (Velocity), and a skeleton feature (Bone), and is formed by sequentially overlapping a batch normalization layer (batcnorm), an Initial block (Initial Shift-GCN) implemented by Shift-map convolution, and two Shift-map convolution blocks (Shift-GCN). Therefore, in step S2, the different items of bone data features obtained in step S1 are input to the corresponding entries in parallel, feature maps are fused by using a concatenation operation, and feature maps corresponding to the respective items of bone data features can be obtained by using the branch-shift convolution sub-network trained in advance.

It should be noted that the network structure shown in fig. 4 is only an exemplary description, and in practical applications, different items of bone data features may be designed to perform fusion analysis according to the fusion requirement, the content may be changed, and the number of branches may also be changed, but it should not be less than two.

In a branch-shift convolution sub-network, a batch normalization layer (Batchnorm) is used to normalize each batch of data that is input.

The core idea of the Shift graph convolution block (Shift-GCN) is that Shift operation of information exchange between nodes on a characteristic graph is combined with 1 x 1 point convolution to achieve the effect similar to GCN convolution and reduce the calculation amount of GCN block. In the aspect of spatial modeling, fully-connected spatial displacement operation and point-by-point convolution operation are carried out on nodes of each frame t in a skeleton sequence; in the aspect of temporal feature extraction, the context features are learned by using time displacement operation of adaptive parameters and point-by-point convolution.

The difference between the displacement map convolution Initial block (Initial Shift-GCN) and the displacement map convolution block is that the residual connection structure is not applied.

In one embodiment of the present disclosure, the displacement map convolution block includes a spatial displacement operation module, a spatial point-by-point convolution module, a temporal displacement operation module, and a temporal point-by-point convolution module.

Fig. 5 schematically illustrates a schematic diagram of a structure of a displacement map convolution block in an exemplary embodiment of the present disclosure, and referring to fig. 5, one displacement map convolution block (Shift-GCN) is formed by sequentially connecting a spatial displacement operation module (S-Shift), a spatial point-by-point convolution module (S-Pointwise), a spatial displacement operation module (S-Shift), a temporal displacement operation module (T-Shift), a temporal point-by-point convolution module (T-Pointwise), and a temporal displacement operation module (T-Shift) in series. Wherein S-Shift, S-Pointwise and S-Shift are used for carrying out displacement calculation on the space, and T-Shift, T-Pointwise and T-Shift are used for carrying out displacement calculation on the time.

For example, if the input and output channels are 256, the channel reduction rate r is 4, and the time window size L is 8, then the basic block contains a total of 256 × 256 × 8 parameters 524288, and the Shift-GCN contains only 256 × 64+64 × 64 × 8+64 × 256 parameters 65536, 524288 is almost 8 times as much as the basic block as 65536.

Therefore, the combination of temporal and spatial shift operations and lightweight point-by-point volume enables the use of the shift map volume Block with non-local spatial shift operations and adaptive temporal shift operations as the unit Block of the network structure to greatly reduce the amount of computation.

In step S3, the feature vectors are obtained by fusing the plurality of feature maps into a mainstream through the fused displacement convolution sub-network.

In an embodiment of the present disclosure, the performing mainstream fusion on the multiple feature maps through the fusion displacement convolution sub-network to obtain a feature vector includes: and uniformly inputting the multiple feature mappings into the pre-trained fusion displacement convolution sub-network to obtain the feature vector.

Specifically, the fused shift convolution sub-network is a neural network trained in advance, and the input of the neural network has only one entry, so that a plurality of feature maps need to be uniformly input to the fused shift convolution sub-network to output one feature vector. The merged shift convolution sub-network at least comprises one shift graph convolution block, and the shift graph convolution block has the same structure as the shift graph convolution block in the branch shift convolution sub-network, and is not described in detail herein.

Referring to fig. 4, the merged Shift convolution sub-network includes six Shift-map convolution blocks (Shift-GCN). Therefore, in step S3, features corresponding to four Bone data features, namely, Bone key point features (Cartesian Coordinates), Bone key point features (Spherical Coordinates), Velocity features (Velocity), and skeleton features (Bone), are mapped and input into the main stream of the six displacement map volume block (Shift-GCN) structure, and finally, the output features of the main stream are mapped and globally averaged to feature vectors representing the input multiple Bone data features.

Meanwhile, a partial Attention mechanism (Part-wise Attention, Part att) was introduced to emphasize the importance of different bones in the overall sequence of actions in the mainstream. The attention mechanism is a data processing method in machine learning, and is widely applied to various different types of machine learning tasks such as natural language processing, image recognition, voice recognition and the like, and will not be described herein in detail.

In step S4, the behavior information of the target object is obtained by using the full connection layer to perform recognition according to the feature vector.

In one embodiment of the present disclosure, a Fully Connected Layer (FC) functions as a "classifier" in the entire convolutional neural network, and can map feature vectors to a sample label space to determine a final action class, i.e., behavior information.

In the following, ablation experiments are also used to illustrate the technical effect of the early fused multiple-input branching architecture model provided by the present disclosure. Ablation experiments comparative experiments were designed using NTU RDB + D60X-view.

The 3-stream is used for carrying out behavior recognition by using 3 characteristics of skeleton key point characteristics, speed characteristics and skeleton characteristics of a Cartesian coordinate system; the 4-stream is used for behavior recognition by using 4 characteristics of skeleton key point characteristics of a Cartesian coordinate system, skeleton key point characteristics of a spherical coordinate system, speed characteristics and skeleton characteristics.

ResGCN is a basic displacement graph volume machine network model constructed by using basic unit blocks ResGCN; the Shift-GCN is a displacement charter network model which integrates an early fusion architecture (MIB) and a Shift-GCN block and is proposed by the disclosure.

Comparison of calculated quantity and accuracy of different models under table 33 entry characteristics

Table 3 shows the comparison of the calculated amount and accuracy of the ResGCN model and the Shift-GCN model using 3 input features. Referring to FIG. 3, although the accuracy of the Shift-GCN model drops from 96.0% to 95.8%, the calculation amount drops abruptly from 18.52G to 2.89G.

The FLOPs refer to Floating-point Operations Per Second, which is a Floating-point operand 10 hundred million times Per Second, and is often used as a GPU performance parameter.

Comparison of calculated quantity and accuracy of different models under input characteristics of table 44 items

Table 4 shows the comparison of the calculated amount and accuracy of the ResGCN model and the Shift-GCN model using 4 input features. Referring to FIG. 4, the accuracy of the Shift-GCN model dropped from 96.2% to 96.0%, but the calculation dropped from 20.32G to 2.96G.

Therefore, combining the calculation results shown in fig. 3 and 4, the accuracy drops by 0.2 point but the calculation amount drops by more than 6 times when the early fusion architecture (MIB) and the Shift-GCN block are integrated.

Based on the method, in the aspect of overall architecture, in order to comprehensively consider accuracy and complexity, the method constructs an early-stage fused multi-input branch architecture model (MIBSD-GCN), fuses a plurality of input branches in the early stage of the model, and applies a main stream after a plurality of characteristic branches are connected in series. The architecture not only preserves rich input features, but also significantly suppresses the complexity and redundancy of the model, making the training process more easily converged.

In the aspect of unit module selection, in order to balance precision and complexity, the whole model is optimized, and a displacement graph convolution block with non-local space displacement operation and self-adaptive time displacement operation is selected as a unit block of a network structure to replace the heavy regular graph convolution. Due to the ingenious combination of displacement operation and lightweight point-by-point convolution, the calculation complexity of the displacement graph convolution is reduced by more than multiple times compared with that of a conventional graph convolution network.

Therefore, the whole framework and the unit module are improved, the high-performance early fusion framework (MIB) and the Shift-GCN unit block are integrated, and the respective advantages are obtained, namely, a light-weight displacement graph convolution network is embedded in the early fusion multi-branch structure, redundant calculation is greatly reduced from the two angles of the network structure and the unit module, the problems of precision and complexity are well considered, and a model which is excellent in precision and calculation amount is obtained.

In one embodiment of the present disclosure, the method further comprises model training the early fused multi-input branch architecture model to obtain a branch-shift convolution sub-network, a fusion-shift convolution sub-network, and a full connectivity layer.

Fig. 6 schematically illustrates a flowchart of a model training method in an exemplary embodiment of the present disclosure, and referring to fig. 6, the model training method includes:

step S61, acquiring a bone data training set and a behavior label of the bone data training set;

step S62, constructing an original input graph according to the bone data training set, and deleting the original input graph to obtain a target input graph;

step S63, identifying the branch displacement convolution sub-network, the fusion displacement convolution sub-network and the full connection layer based on the target input graph to obtain identification behavior information;

step S64, comparing the identification behavior information with the behavior tag to modify the model parameters of the branch displacement convolution sub-network, the fusion displacement convolution sub-network, and the full connection layer.

Specifically, the training process is substantially the same as that of a common neural network, the real behavior labels are used for training a data training set, and the identification behavior information is compared with the behavior labels to train to obtain various parameters in the model, so that the trained model is obtained finally.

In contrast to the prior art, in step S62, instead of directly using the input graph for behavior recognition, an edge deletion (DropEdge) mechanism is introduced in the training process.

In the convolution of the non-local displacement graph, all nodes in the input graph are connected with each other, and the connection strength between different nodes is the same. As the number of connections between nodes increases, the connection relationships become more complex, resulting in the overfitting and the over-smoothing of the calculation results becoming more severe. Therefore, different random deformation copies of the original graph are actually generated through an edge deletion mechanism, the randomness and diversity of input data are increased, and overfitting is better prevented. In addition, edge deletion, which is viewed in graph convolution as the message passing reducer deleting certain edges, makes node connections more sparse, thus avoiding excessive smoothing to some extent,

thus, the present disclosure proposes an edge deletion operation in displacement graph convolution.

In the operation of the non-local displacement graph, the spatial bone characteristic graph is given, the moving distance of the ith channel is i, the number N of key points of the bone is taken as a modulus, and the moving-out channel is used for filling the corresponding moving-in channel. FIG. 7 schematically illustrates a generic non-local displacement map operation in an exemplary embodiment of the disclosure. Fig. 7(a) shows a displacement graph of the node 1 after non-local displacement, fig. 7(b) shows a non-local displacement process diagram of the node 1, and fig. 7(c) shows a displacement characteristic graph after non-local displacement.

Considering that a non-local displacement graph operation is a directed operation, the entire displacement graph is considered as a directed graph and all edges are considered as directed edges. When planning to delete an edge from one node to another, we only need to stop the shift operation from the move-out node to the move-in node and keep the move-in channel at its original value on the feature map, instead of being populated as the value of the move-out channel.

Fig. 8 schematically illustrates a non-local displacement graph operation diagram for band edge deletion in an exemplary embodiment of the present disclosure. Referring to fig. 8, edges from node 3 to node 1 and node 6 to node 1 are deleted, so that the feature map remains as it is in the move-in path corresponding to node 3 to node 1 and node 6 to node 1, without performing a shift operation. Fig. 8(a) shows a displacement diagram of non-local displacement of edge deletion, fig. 8(b) shows a schematic diagram of non-local displacement process of edge deletion, in which deleted edges, i.e. edges corresponding to the "x" symbol, are marked, and fig. 8(c) shows a displacement characteristic diagram of non-local displacement of edge deletion.

As with the ordinary non-local displacement graph convolution, the displacement graph operation is connected in series with the point convolution operation, and then the displacement graph convolution module with the edge deletion mechanism is formed. Notably, the edge deletion mechanism is only active during training, and the edge deletion strategy is not used in validation and testing.

In one embodiment of the present disclosure, when performing edge deletion in step S62, it is also necessary to determine to delete an edge. There may be two ways, one is a random edge deletion policy and the other is an adaptive attention-directed edge deletion policy.

Specifically, for a first random edge deletion policy, the performing edge deletion on the original input graph to obtain a target input graph includes:

step S6211, constructing an original adjacency matrix according to the displacement relation in the original input graph;

step S6212, a target adjacency matrix corresponding to the target input graph is calculated based on the original adjacency matrix and a preset random edge retention rate.

Further, in step S6211, in accordance with any ordinary graph convolution, the adjacency matrix A is first used_shitTo represent the displacement relationship in the directed graph. A. the_shiftIs a matrix whose elements A_shift＝{a_ijI 1., N, j 1.,. N } represents a displacement relationship between the respective nodes. Wherein v is_iRepresenting a move into a node and v_jRepresenting the removed node. If there is a slave v_iTo v_jIs kept connected, then denoted as v_j→v_iThus, the elements in the matrix can be represented as follows:

non-local displacement map A_shiftIs 1.

In step S6212, when the random edge deletion strategy is applied, the non-local displacement graph randomly discards a certain percentage of edges at each stage of training the model. Formally, the non-zero elements of nxnx (1-p) in the adjacency matrix are randomly forced to be zero, where nxn is the total number of edges, p is the random edge retention rate, i.e., the probability that an edge is retained, which is a preset value, and 1-p is the random edge deletion rate.

From this, A can be calculated_drop，A_dropIs a sparse matrix that is expanded by a randomly sized subset of the original edges, i.e., representing the deleted edges. With A_shift-dropRepresenting the generated adjacency matrix, the calculated relationship can then be expressed as follows:

A_shift-drop＝A_shift-A_drop (5)

A_shift-dropwhether to perform a displacement operation between corresponding nodes of the non-local displacement graph convolution is controlled. If A is_shift-dropIs 1, then v is executed_j→v_iNormal displacement operation, otherwise, no displacement operation is carried out, and the characteristic diagramUpper corresponds to v_j→v_iThe displacement channel of (1) holds the original value.

In particular, compared to using the same A in the same training batch_shift-dropWe prefer to obtain this by independent computation layer by layer, denoted a^l _shift-drop. This layer-by-layer edge deletion mode brings more randomness and distortion to the original data and higher performance than the batch edge deletion mode.

For a second adaptive attention-directed edge deletion policy, the performing edge deletion on the original input graph to obtain a target input graph includes:

step S6221, constructing an attention template adjacency matrix according to the displacement relation in the original input graph;

step S6222, calculating edge retention probability of each edge of the original input image based on the attention template adjacency matrix and a preset adaptive edge deletion parameter;

step 6223, a target adjacency matrix corresponding to the target input graph is obtained according to the edge retention rate.

In particular, in non-local displacement graph convolution, the strength of the connection between different nodes is the same, and the probability that all edges are discarded is also the same. However, in actual motion recognition, the importance of each edge is different. Therefore, it is desirable to retain edges that contribute more to motion recognition with higher probability and discard edges that contribute less with higher probability when performing edge deletion.

Based on the thought, an adaptive attention-guided edge deletion mechanism is provided, so that important edges have higher sampling probability. The probability calculation method is as follows:

P^l＝d·tanh(k·M^l _A)+(1-d) (6)

wherein, P^lIs an edge-preserving matrix, M^l _AThe method includes the steps of defining an Attention template adjacency matrix, defining a preset adaptive edge deletion parameter (attribute-defined drop parameter) for controlling a probability interval of edge deletion, and defining a scale parameter k.

P^lAnd M^l _AIs a heel A_shiftMatrix of the same size, P^lA value of (1) represents the probability that an edge is retained, M^l _AFor attention template adjacency matrix, which is a learnable attention template, to get the distribution of attention regions, it is usually implicitly assumed that the absolute value of activation represents the importance of one unit, so we follow this assumption and generate M by averaging the absolute values of the channels corresponding to the edges^l _A。

Based on the constraint of the formula (6), deleting the edge according to the calculated edge retention probability, and controlling the probability of the retained edge to be [1-2 x d,1]Within the interval, the passing probability distribution is P^lTo obtain the result after the final edge deletion.

In the following, ablation experiments are also used to illustrate the technical effect of introducing an edge deletion mechanism in the model of the present disclosure. The ablation experiment still adopts NTU RDB + D60X-view, and a contrast experiment is designed.

The Batch-wise deletion strategy is to randomly throw out edges according to the probability in each training Batch, and the Layer-wise deletion strategy is to randomly throw out edges according to the probability in each Layer.

It should be noted that, under the random edge deletion policy, both of these two manners are possible, but under the adaptive attention-directed edge deletion policy, deletion can be performed only according to each layer, and "-" indicates that the edge deletion policy is not used.

TABLE 5 comparison of precision of different random edge deletion rates for random edge deletion strategies

Referring to table 5, when the random edge deletion rate is increased based on 0, the model accuracy is improved in both the Batch-wise and Layer-wise strategies. And for the same random edge deletion rate, the precision of Layer-wise is better than that of Batch-wise.

TABLE 6 comparison of precision of different random edge deletion rates for random edge deletion strategies

In table 6, the model accuracy is also improved when the adaptive edge deletion parameter is increased with the adaptive edge deletion parameter set to 0 as a reference.

Thus, referring to tables 5 and 6, it is apparent that Layer-wise outperforms Batch-wise in the random Dropedge strategy, while Attention-defined Dropedge outperforms the random Dropedge strategy, both of which outperform models that do not use the Dropedge strategy.

Based on the method, considering that the non-local Shift-GCN unit block is embedded in the early fusion framework MIB in the scheme, the connection relation becomes complex, and the risk of over-fitting and over-smoothing becomes high, a side deletion mechanism is introduced in the model training process, data is enhanced to a certain extent, GCN is sparser, information transmission is more concentrated, and the risk of over-fitting and over-smoothing is reduced.

In particular, the adaptive attention-guided edge deletion strategy can also utilize the learnable attention template, edges which have greater contribution to motion recognition are reserved with higher probability, and edges which have less contribution are discarded with higher probability, so that the accuracy of the model is ensured.

Next, the model identification effect of the solution provided by the present disclosure is compared with other solutions known in the prior art, and the results are filled in table 7 to illustrate the advantages of the solution.

TABLE 7 accuracy comparison of different network models

Referring to fig. 7, compared with the prior art method, the MIBSD-GCN provided by the present application has a very significant computational advantage (2.96G shown in table 7) with a higher accuracy (96.6% shown in table 7).

It should be noted that the experimental effects in the present application are only exemplary and do not limit the disclosure.

In one embodiment of the present disclosure, behavior recognition is performed using image data collected from office buildings, for example. Data is acquired by a depth camera, and data sampling is performed at a specified frame rate FPS of 30, and a segment is obtained according to 150 frames (i.e., 5 seconds). The 3D bone point joint coordinates for each frame are extracted in each video segment.

Fig. 9 is a schematic diagram showing bone key points at three consecutive times, where fig. 9(a) is a schematic diagram showing bone key points at 0s, fig. 9(b) is a schematic diagram showing bone key points at 5s, and fig. 9(c) is a schematic diagram showing bone key points at 5s and at 10s, respectively.

After the skeleton key point sequence is obtained, the skeleton key point sequence is converted into four skeleton data characteristics, namely skeleton key point characteristics of a Cartesian coordinate system, skeleton key point characteristics of a spherical coordinate system, speed characteristics and skeleton characteristics, an early-fusion multi-input branch system structure model is used for identification and analysis, and final action information is obtained, wherein the action type is 'falling'.

The behavior recognition method mainly solves the problems of complexity, high accuracy and insufficient accuracy of the prior art, and enhances the practicability of the model so as to improve the ratio of the system performance of automatic behavior recognition to the automatic processing task. The main contributions are summarized below:

(1) an early-stage fused multi-input branch architecture model is designed, and input is acquired from four independent space-time characteristic sequences, including skeleton joint points in a Cartesian coordinate system, skeleton joint points in a spherical coordinate system, speed and skeleton characteristics. Particularly, novel bone joint points in a spherical coordinate system are added into an input branch, and the effectiveness of the method is proved.

(2) And embedding a displacement map volume block consisting of displacement map operation and lightweight point-by-point convolution into an early-stage fused multi-input branch network so as to reduce the calculation amount of a unit map convolution module.

(3) Edge deletion rules in displacement graph convolution and adaptive attention-directed edge deletion strategies are proposed to prevent over-fitting and over-smoothing.

The method can be attached to an intelligent video analysis platform service system, and is applied to various business scenes including but not limited to nursing community monitoring, intelligent building monitoring, intelligent visual interaction and other scenes requiring high-precision automatic behavior identification and processing. For example, in an endowment/major health monitoring system, the risk behaviors existing in an endowment community can be accurately predicted in time, such as: the old people fall down, are abused and the like, so that the old people can get medical aid in time, and the probability of danger and accidents of the old people is reduced; in the intelligent human-computer interaction system, human interaction behaviors can be recognized, and specific instructions are executed in a non-contact manner, so that human-computer interaction is more intelligent; in the insurance double-record quality inspection system, specific quality inspection regulated behaviors can be identified, automatic audit is realized, the service efficiency is improved, and the labor cost is reduced.

Fig. 10 schematically illustrates a composition diagram of a behavior recognition apparatus in an exemplary embodiment of the present disclosure, and as shown in fig. 10, the behavior recognition apparatus 1000 may include an obtaining module 1001, a branching module 1002, a fusing module 1003, and a recognition module 1004. Wherein:

an obtaining module 1001, configured to obtain multiple bone data features of a target object within a preset time;

a branch module 1002, configured to perform branch processing on the multiple bone data features by using a branch displacement convolution sub-network to obtain multiple feature maps;

the fusion module 1003 is configured to perform mainstream fusion on the multiple feature maps through a fusion displacement convolution sub-network to obtain a feature vector;

and the identifying module 1004 is configured to identify according to the feature vector by using a full connection layer to obtain behavior information of the target object.

According to an exemplary embodiment of the present disclosure, the obtaining module 1001 is configured to obtain image data within a preset time that is collected by a camera according to a preset frame rate set; detecting bone key points of the target object in each frame of the image data to obtain the bone key point sequence; performing feature processing on the bone key point sequence to obtain the multiple items of bone data features; wherein the plurality of items of bone data features comprise at least two items of bone key point sequence features of a first coordinate system, bone key point features of a second coordinate system, speed features and skeleton features.

According to an exemplary embodiment of the present disclosure, the branch module 1002 is configured to input each item of bone data feature to the branch displacement convolution sub-network corresponding to the bone data feature and trained in advance in parallel, so as to obtain feature maps corresponding to each item of bone data feature; the branch displacement convolution sub-network is constructed by sequentially connecting a batch normalization layer, a displacement map convolution initial block and at least one displacement map convolution block in series.

According to an exemplary embodiment of the present disclosure, the fusion module 1003 is configured to uniformly input the multiple feature mappings to the fusion displacement convolution sub-network trained in advance, so as to obtain the feature vector; the fusion displacement convolution sub-network is constructed by sequentially connecting at least one displacement graph rolling block in series.

According to an exemplary embodiment of the present disclosure, the displacement map convolution block includes a spatial displacement operation module, a spatial point-by-point convolution module, a temporal displacement operation module, and a temporal point-by-point convolution module.

According to an exemplary embodiment of the present disclosure, the behavior recognition apparatus 1000 may further include a training module, configured to obtain a skeletal data training set and behavior labels of the skeletal data training set; constructing an original input graph according to the bone data training set, and deleting the original input graph to obtain a target input graph; identifying by using the branch displacement convolution sub-network, the fusion displacement convolution sub-network and the full connection layer based on the target input graph to obtain identification behavior information; and comparing the identification behavior information with the behavior label to modify the model parameters of the branch displacement convolution sub-network, the fusion displacement convolution sub-network and the full connection layer.

According to an exemplary embodiment of the present disclosure, the training module further includes an edge deletion unit, where the edge deletion unit is configured to construct an original adjacency matrix according to a displacement relationship in the original input graph; calculating a target adjacency matrix corresponding to the target input graph based on the original adjacency matrix and a preset random edge retention rate; or constructing an attention template adjacency matrix according to the displacement relation in the original input graph; calculating the edge retention probability of each edge of the original input graph based on the attention template adjacency matrix and a preset self-adaptive edge deletion parameter; and acquiring a target adjacency matrix corresponding to the target input graph according to the edge retention probability.

The specific details of each module in the behavior recognition apparatus 1000 are already described in detail in the corresponding behavior recognition method, and therefore are not described herein again.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

In an exemplary embodiment of the present disclosure, there is also provided a storage medium capable of implementing the above-described method. Fig. 11 schematically illustrates a schematic diagram of a computer-readable storage medium in an exemplary embodiment of the disclosure, and as shown in fig. 11, a program product 1100 for implementing the above method according to an embodiment of the disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a mobile phone. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided. Fig. 12 schematically shows a structural diagram of a computer system of an electronic device in an exemplary embodiment of the disclosure.

It should be noted that the computer system 1200 of the electronic device shown in fig. 12 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 12, the computer system 1200 includes a Central Processing Unit (CPU)1201, which can perform various appropriate actions and processes in accordance with a program stored in a Read-Only Memory (ROM) 1202 or a program loaded from a storage section 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data necessary for system operation are also stored. The CPU 1201, ROM 1202, and RAM 1203 are connected to each other by a bus 1204. An Input/Output (I/O) interface 1205 is also connected to bus 1204.

The following components are connected to the I/O interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output section 1207 including a Display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 1208 including a hard disk and the like; and a communication section 1209 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. A driver 1210 is also connected to the I/O interface 1205 as needed. A removable medium 1211, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 1210 as necessary, so that a computer program read out therefrom is mounted into the storage section 1208 as necessary.

In particular, the processes described below with reference to the flowcharts may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 1209, and/or installed from the removable medium 1211. The computer program, when executed by a Central Processing Unit (CPU)1201, performs various functions defined in the system of the present disclosure.

It should be noted that the computer readable medium shown in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

As another aspect, the present disclosure also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method described in the above embodiments.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of behavior recognition, comprising:

acquiring a plurality of skeletal data characteristics of a target object;

performing branch processing on the multiple bone data characteristics by using a branch displacement convolution sub-network to obtain multiple characteristic mappings;

performing mainstream fusion on the multiple feature mappings through a fusion displacement convolution sub-network to obtain a feature vector;

and identifying by utilizing a full connection layer according to the characteristic vector to obtain the behavior information of the target object.

2. The behavior recognition method according to claim 1, wherein the obtaining of the plurality of skeletal data features of the target object comprises:

acquiring image data within preset time collected by a camera according to a preset frame rate set;

detecting bone key points of the target object in each frame of the image data to obtain the bone key point sequence;

performing feature processing on the bone key point sequence to obtain the multiple items of bone data features; wherein the plurality of items of bone data features comprise at least two items of bone key point sequence features of a first coordinate system, bone key point features of a second coordinate system, speed features and skeleton features.

3. The behavior recognition method of claim 1, wherein the branch processing of the plurality of bone data features using the branch-shift convolution sub-network to obtain a plurality of feature maps comprises:

inputting all the bone data characteristics into the branch displacement convolution sub-network which is corresponding to the bone data characteristics and is trained in advance in parallel to obtain characteristic mappings corresponding to all the bone data characteristics respectively; the branch displacement convolution sub-network is constructed by sequentially connecting a batch normalization layer, a displacement map convolution initial block and at least one displacement map convolution block in series.

4. The behavior recognition method according to claim 1, wherein the mainstream fusion of the plurality of feature maps by the fusion displacement convolution sub-network to obtain a feature vector comprises:

uniformly inputting the multiple feature mappings into the pre-trained fusion displacement convolution sub-network to obtain the feature vectors; the fusion displacement convolution sub-network is constructed by sequentially connecting at least one displacement graph rolling block in series.

5. The behavior recognition method according to any one of claims 3 or 4, wherein the displacement map convolution block includes a spatial displacement operation module, a spatial point-by-point convolution module, a temporal displacement operation module, and a temporal point-by-point convolution module.

6. The behavior recognition method of claim 1, further comprising model training the branch shift convolution sub-network, the fusion shift convolution sub-network, and the fully-connected layer, the model training comprising:

acquiring a bone data training set and a behavior label of the bone data training set;

constructing an original input graph according to the bone data training set, and deleting the original input graph to obtain a target input graph;

identifying by using the branch displacement convolution sub-network, the fusion displacement convolution sub-network and the full connection layer based on the target input graph to obtain identification behavior information;

and comparing the identification behavior information with the behavior label to modify the model parameters of the branch displacement convolution sub-network, the fusion displacement convolution sub-network and the full connection layer.

7. The behavior recognition method according to claim 6, wherein the performing edge deletion on the original input graph to obtain a target input graph comprises:

constructing an original adjacency matrix according to the displacement relation in the original input graph;

calculating a target adjacency matrix corresponding to the target input graph based on the original adjacency matrix and a preset random edge retention rate; or

Constructing an attention template adjacency matrix according to the displacement relation in the original input graph;

calculating the edge retention probability of each edge of the original input graph based on the attention template adjacency matrix and a preset self-adaptive edge deletion parameter;

and acquiring a target adjacency matrix corresponding to the target input graph according to the edge retention probability.

8. A behavior recognition apparatus, comprising:

the acquisition module is used for acquiring a plurality of skeletal data characteristics of the target object within preset time;

the branch module is used for performing branch processing on the multiple bone data characteristics by using a branch displacement convolution sub-network to obtain multiple characteristic mappings;

the fusion module is used for carrying out mainstream fusion on the multiple feature mappings through a fusion displacement convolution sub-network to obtain a feature vector;

and the identification module is used for identifying according to the characteristic vector by utilizing the full connection layer to obtain the behavior information of the target object.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method of behavior recognition according to any one of claims 1 to 7.

10. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the behavior recognition method according to any one of claims 1 to 7.