WO2022073282A1 - Motion recognition method based on feature interactive learning, and terminal device - Google Patents
Motion recognition method based on feature interactive learning, and terminal device Download PDFInfo
- Publication number
- WO2022073282A1 WO2022073282A1 PCT/CN2020/129550 CN2020129550W WO2022073282A1 WO 2022073282 A1 WO2022073282 A1 WO 2022073282A1 CN 2020129550 W CN2020129550 W CN 2020129550W WO 2022073282 A1 WO2022073282 A1 WO 2022073282A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- neural network
- video sequence
- network model
- module
- convolution
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 66
- 230000002452 interceptive effect Effects 0.000 title claims abstract description 20
- 238000003062 neural network model Methods 0.000 claims abstract description 122
- 238000012545 processing Methods 0.000 claims abstract description 25
- 230000009471 action Effects 0.000 claims description 77
- 239000011159 matrix material Substances 0.000 claims description 68
- 230000006870 function Effects 0.000 claims description 34
- 238000004590 computer program Methods 0.000 claims description 25
- 238000010606 normalization Methods 0.000 claims description 16
- 230000004913 activation Effects 0.000 claims description 15
- 238000013528 artificial neural network Methods 0.000 claims description 15
- 238000004364 calculation method Methods 0.000 claims description 14
- 238000012549 training Methods 0.000 claims description 12
- 230000004927 fusion Effects 0.000 claims description 9
- 230000003993 interaction Effects 0.000 claims description 9
- 238000007906 compression Methods 0.000 abstract description 11
- 230000006835 compression Effects 0.000 abstract description 10
- 238000005070 sampling Methods 0.000 abstract description 7
- 238000013527 convolutional neural network Methods 0.000 description 19
- 238000010586 diagram Methods 0.000 description 13
- 230000008569 process Effects 0.000 description 11
- 230000000694 effects Effects 0.000 description 6
- 230000002123 temporal effect Effects 0.000 description 6
- 230000004044 response Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000013144 data compression Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000009977 dual effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 241001465754 Metazoa Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000012806 monitoring device Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/85—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
Definitions
- the present application belongs to the technical field of computer vision, and in particular, relates to an action recognition method and terminal device based on feature interaction learning.
- human action recognition has become one of the research hotspots in the field of computer vision.
- computers can automatically understand and describe human actions in videos, which has great application value in many fields, such as video surveillance, human-computer interaction, motion analysis, content-based video retrieval, and autonomous driving.
- the methods of human action recognition mainly include methods based on artificially designed features and methods based on neural network deep learning features.
- the methods based on neural network deep learning features have achieved certain success in the recognition of human actions.
- the current human action recognition method based on neural network deep learning when processing the action classification and recognition of long video sequences, a certain number of video frames are obtained through sparse sampling as the input of the neural network, and the video frames are extracted layer by layer through the neural network.
- features, to identify and classify human actions due to the complexity and variability of video shooting angles, shooting dimensions, shooting backgrounds, and the differences and similarities of actions, the sparse sampling method for a single modality has a low accuracy rate of action recognition. .
- the embodiments of the present application provide an action recognition method and terminal device based on feature interaction learning, which can solve the problem of low accuracy of action recognition due to the sparse sampling method of a single modality.
- an embodiment of the present application provides an action recognition method based on feature interaction learning, the method includes: acquiring video data of an action to be recognized, the video data including a first video sequence and a second video sequence; Perform compression processing on the first video sequence and the second video sequence respectively to obtain a first motion picture corresponding to the first video sequence and a second motion picture corresponding to the second video sequence;
- the motion map and the second motion map are input into the trained dual-stream neural network model, and the features of the first motion map and the features of the second motion map are interactively learned through the trained dual-stream neural network model, Obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence output by the trained dual-stream neural network model; based on the first prediction result and the second prediction result , and determine the classification result of the action to be recognized.
- performing compression processing on the first video sequence to obtain a first motion picture corresponding to the first video sequence including:
- performing compression processing on the second video sequence to obtain a second motion image corresponding to the second video sequence includes:
- the trained dual-stream neural network model includes a first neural network model, a second neural network model, and a routing module, and the routing module is set in the first neural network between the intermediate convolution module of the model and the intermediate convolution module of the second neural network model; the input of the first neural network model is the first motion map, and the output is the first video sequence of the the first prediction result; the input of the second neural network model is the second motion map, and the output is the second prediction result of the second video sequence; the routing module is used for Between the intermediate convolution module of the network model and the intermediate convolution module of the second neural network model, interactive learning is performed on the features of the first motion map and the features of the second motion map.
- the intermediate convolution module of the first neural network model includes a first convolution module with a preset number of layers
- the intermediate convolution module of the second neural network model includes A second convolution module corresponding to the first convolution module
- the first motion map and the second motion map are input into the trained dual-stream neural network model, and the trained dual-stream neural network
- the network model performs interactive learning on the features of the first motion map and the features of the second motion map to obtain a first prediction result of the first video sequence and a second prediction result of the second video sequence, including :
- the output of the first convolution module of the first layer and the output of the second convolution module of the first layer are used as the input of the routing module of the first layer, and the feature interactive learning is performed by the routing module of the first layer, and the first layer is obtained.
- a route output take the superposition result of the output of the first convolution module of the first layer and the output of the first route as the input of the first convolution module of the second layer, and the first convolution module of the second layer
- the convolution module performs feature learning to obtain the output of the first convolution module of the second layer; the superposition result of the output of the second convolution module of the first layer and the output of the first route is used as the second layer
- the first convolution module of the first layer and the first convolution module of the second layer are two adjacent convolution layers in the middle convolution module of the first neural network model;
- the second convolution module of one layer and the second convolution module of the second layer are two adjacent convolution layers in the middle convolution module of the second neural network model;
- the routing module and the routing module of the second layer are two adjacent computing modules.
- the routing module includes: a first convolution unit, a first normalization unit, a first activation unit, a second convolution unit, a second normalization unit, second activation unit; through the first convolution unit, the first normalization unit, the first activation unit, the second convolution unit, the second normalization unit of the routing module unit and the second activation unit, in turn, perform interactive learning on the feature matrix output by the convolution calculation module of the first neural network model and the feature matrix output by the convolution calculation module of the second neural network model, and obtain the The feature matrix output by the routing module described above.
- the determining a classification result of the to-be-recognized action based on the first prediction result and the second prediction result includes:
- Feature fusion is performed on the first prediction result and the second prediction result to obtain a probability distribution of action categories; the action category with the highest probability in the probability distribution is used as the classification result of the action to be recognized.
- the first neural network model includes a first loss function
- the second neural network model includes a second loss function
- the model, the second neural network model and the routing module are trained, and the parameters of the first neural network model and the second neural network model are adjusted according to the first loss function and the second loss function, respectively. parameters and the parameters of the routing module; if the first loss function and the second loss function meet the preset threshold, stop the parameters of the first neural network model and the second neural network model. parameters and the training of the routing module to obtain the trained dual-stream neural network model.
- an embodiment of the present application provides an action recognition device based on feature interactive learning, including:
- an acquisition unit configured to acquire video data of an action to be recognized, the video data including a first video sequence and a second video sequence;
- a processing unit configured to compress the first video sequence and the second video sequence respectively, to obtain a first motion picture corresponding to the first video sequence and a second motion picture corresponding to the second video sequence ;
- the computing unit is used for inputting the first motion map and the second motion map into a trained dual-stream neural network model, and through the trained dual-stream neural network model, the features of the first motion map and the described
- the features of the second motion map are interactively learned to obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence;
- An output unit configured to determine a classification result of the action to be recognized based on the first prediction result and the second prediction result.
- an embodiment of the present application provides a terminal device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program When implementing the method described in the first aspect and possible implementation manners of the first aspect.
- embodiments of the present application provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the first aspect and possible implementations of the first aspect are implemented method described.
- an embodiment of the present application provides a computer program product that, when the computer program product runs on a terminal device, enables the terminal device to execute the action recognition method described in any one of the first aspects above.
- the terminal device can obtain video data of actions to be recognized, and the video data includes a first video sequence and a second video sequence;
- the sequence and the second video sequence are respectively compressed to obtain the first motion map corresponding to the first video sequence and the second motion map corresponding to the second video sequence;
- the first motion map and the second motion map are input into the trained dual-stream neural network.
- the network model through the trained dual-stream neural network model, interactively learns the features of the first motion map and the features of the second motion map, and obtains the first prediction result of the first video sequence and the second prediction result of the second video sequence; Based on the first prediction result and the second prediction result, the classification result of the action to be recognized is determined; the first motion picture and the second motion picture are obtained by compressing the first video sequence and the second video sequence respectively, and the video data is enriched
- the spatial and temporal representation of the model makes the information representation more complete and the features richer; thus, the first motion map and the second motion map are used as the input of the dual-stream neural network model, and the multi-modal image features are interactively learned through the neural network model.
- the accuracy of action recognition has strong ease of use and practicality.
- FIG. 1 is a schematic flowchart of an action recognition method provided by an embodiment of the present application.
- FIG. 2 is a schematic diagram of video data compression processing provided by an embodiment of the present application.
- FIG. 3 is a schematic diagram of a network architecture of a dual-stream neural network model provided by an embodiment of the present application
- FIG. 4 is a schematic diagram of the architecture of a routing module of a dual-stream neural network provided by an embodiment of the present application
- FIG. 5 is a schematic diagram of the architecture of a middle-level feature interaction learning unit of a dual-stream neural network provided by an embodiment of the present application;
- FIG. 6 is a schematic structural diagram of a motion recognition device provided by an embodiment of the present application.
- FIG. 7 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
- the term “if” may be contextually interpreted as “when” or “once” or “in response to determining” or “in response to detecting “.
- the phrases “if it is determined” or “if the [described condition or event] is detected” may be interpreted, depending on the context, to mean “once it is determined” or “in response to the determination” or “once the [described condition or event] is detected. ]” or “in response to detection of the [described condition or event]”.
- references in this specification to "one embodiment” or “some embodiments” and the like mean that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the present application.
- appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically emphasized otherwise.
- the terms “including”, “including”, “having” and their variants mean “including but not limited to” unless specifically emphasized otherwise.
- models such as 2D convolutional network 2D-ConvNets, 3D convolutional network 3D-ConvNets, and recurrent neural network (RNN) based on convolutional neural networks are mainly used.
- RNN recurrent neural network
- Human action recognition method based on convolutional neural network, given a certain number of RGB or depth video sequences, a certain number of video frames are obtained as the input of the network through the sparse sampling method, and the convolutional neural network extracts the features in the video frames layer by layer, And the human actions are classified and recognized through a classifier or normalization (Softmax function) layer.
- Softmax function Softmax function
- human action recognition methods based on convolutional neural networks can be divided into the following two categories: First, 2D end-to-end network training structure: supervised training of deep networks through large-scale labeled datasets, and then through parameter fine-tuning training Get the trained model for the actual task. For video sequences, this method mainly uses sparse sampling to obtain a certain frame of the entire video sequence as the network input, and cannot learn the temporal dimension features of human actions well. Second, the 3D end-to-end network training structure: a few frames of images are obtained through sparse sampling as the input of the network model, and the classification model is obtained through supervised training and parameter fine-tuning training. This method can obtain better recognition effect, but the huge amount of calculation restricts its application in practical scenarios.
- RNN recurrent convolutional network
- the RNN network memorizes the previous information and applies it to the calculation of the current output. It can process sequence data of any length, and realize feature learning and recognition and classification through the ordered loop learning of the input sequence, which is widely used in the field of natural language processing. It has achieved good applications, but needs to be further improved in human action recognition.
- action recognition methods based on convolutional neural networks or other deep networks lack the representation of multi-modal spatiotemporal information of long video sequences and the mutual learning of multi-modal features, so that action recognition can be achieved.
- the accuracy still needs to be improved.
- This application will realize the processing and recognition of user action classification based on the representation of multimodal spatiotemporal information of long video sequences and the mutual learning of multimodal features, which further improves the accuracy of user action recognition.
- the technical solutions of the present application will be described in detail below with reference to the drawings and specific embodiments.
- FIG. 1 is a schematic flowchart of a motion recognition method provided by an embodiment of the present application.
- the execution body of the method may be an independent terminal, such as a mobile phone, a computer, a multimedia device, a streaming media device, a monitoring device and other terminal devices; It can also be an integrated module in the terminal device, which can be implemented as a certain function in the terminal device.
- the following describes an example of the method being executed by a terminal device, but the embodiment of the present application is not limited to this. As shown in Figure 1, the method includes:
- Step S101 acquiring video data of the action to be recognized, where the video data includes a first video sequence and a second video sequence.
- the video data is a sequential multi-frame image sequence combined in time, a sequence of all image frames of the entire video for which the action is to be identified.
- the terminal device can acquire the video data of the action to be recognized in real time through the RGB-D camera device; the video data of the action to be recognized can also be the video data pre-stored in the terminal device.
- the first video sequence and the second video sequence are video frame sequences of two different modalities, that is, different feature representations of the same piece of video data, for example, a color video sequence in RGB format and a video sequence represented by depth information, respectively. , video sequence or skeleton sequence in optical flow graph format, etc.
- the first video sequence is a color video sequence in RGB format and the second video sequence is a video sequence represented by depth information.
- a color video sequence is a multi-frame image sequence in RGB format, that is, a color image sequence in which pixel information in each frame of image is represented by three colors of red, green, blue, and RGB; a video sequence represented by depth information is represented by depth values.
- the depth image sequence of the pixel point information in the frame image, the image depth of each frame image determines the possible color number or possible gray level of each pixel point of the image.
- the acquired sequence frames of the first video sequence and the second video sequence can be in a one-to-one correspondence in time sequence, that is, each video frame of the first video sequence at the same moment.
- the first video sequence and the second video sequence of the same video can include the same number of video frames; thus, they can be represented by different feature quantities. Spatiotemporal information of video frames at the same time.
- the spatiotemporal information includes the information of the temporal dimension and the information of the spatial dimension of the video frame sequence; the information of the temporal dimension is represented at different time points corresponding to a video frame, and the continuous video frame sequence constitutes a dynamic effect by the continuity of time; the spatial dimension
- the information can be expressed as texture information or color information of each video frame.
- a video frame of a color video sequence in RGB format is represented by a 3-channel*widthW*heightH matrix, and the elements in the three channels represent the color information of each pixel in the video frame; the video represented by depth information
- the depth information is measured by length units (such as millimeters, etc.).
- the distance information representing the depth and the grayscale information are converted correspondingly, and the matrix is obtained in the form of 1 channel * width W * height H Represented grayscale image.
- sequence frames of the first video sequence and the sequence frames of the second video sequence correspond one-to-one in time sequence, and are the same piece of video data.
- the first The video sequence includes 50 frames of images
- the second video frame also includes 50 frames of images.
- the first video sequence is color video data in RGB format, it can be acquired by a camera in RGB format; if the second video sequence is a video sequence represented by depth information, it can be acquired by a depth camera; two The shooting parameters set by the various cameras may be the same, and the shooting parameters for the same target are shot in the same time period, and there is no specific limitation here.
- the terminal device can be a device integrated with the camera, and the terminal device can directly obtain the video data of the action to be recognized through the camera device; the terminal device can also be a device that is separate from the camera device, and the terminal device can be wired or wirelessly connected to the camera device.
- the camera device is connected in communication to obtain video data of the action to be recognized.
- the action to be identified may be a human action or an activity action, or an animal action or action action, without any specific limitation.
- the terminal device obtains the image frame sequence of the entire video data of the action to be recognized, records the multi-modal underlying features of the video data, and makes good use of the first video sequence and the second video sequence.
- the features of spatiotemporal information of different modalities provide a basis for the subsequent neural network model to learn various possibilities for features, and enhance the neural network model's ability to express and recognize image features.
- Step S102 Compress the first video sequence and the second video sequence respectively to obtain a first motion picture corresponding to the first video sequence and a second motion picture corresponding to the second video sequence.
- the terminal device compresses the multi-frame images of the first video sequence and the multi-frame images of the second video sequence respectively, so as to obtain rich spatial and temporal data.
- the feature representation of the first motion picture is different from the feature representation of the second motion picture, and is the representation of different underlying features of the same video, that is, the videos in the first video sequence and the second video sequence are represented by different image information. Image features of the frame.
- the spatiotemporal information of the first motion picture includes the spatiotemporal information of all video frames of the first video sequence
- the spatiotemporal information of the second motion picture includes the spatiotemporal information of all the video frames of the second video sequence; for example, the RGB video sequence and the depth video sequence are combined.
- the temporal dimension information and spatial dimension information of the image are compressed and expressed as a single three-channel image and a single-channel image, respectively, showing dynamic effects and information such as color and texture.
- each video frame of the first video sequence corresponds to a feature matrix
- each video frame of the second video sequence corresponds to a feature matrix
- the first video sequence or the second video sequence may respectively include T frame images
- the feature matrix corresponding to each frame of image is It, then the feature matrix set of the first video sequence or the feature matrix set of the second video sequence can be expressed as ⁇ I 1 ,I 2 ,I 3 ,..., I T >, where I 1 is the feature matrix of the first frame image in the video sequence arranged in time series, and so on, I T is the characteristic matrix of the T-th frame image in the video sequence arranged in time series.
- the first video sequence and the second video sequence are respectively subjected to compression processing, and multiple frames of images of the video sequence are compressed and synthesized into an image, which contains feature information representing actions through time and space, which can be called is a motion map, so as to obtain a paired first motion map and a second motion map containing the spatiotemporal information of the entire video sequence; that is, the feature matrices of multiple frames of images are combined in one image to represent, so that all the video sequences in the video sequence can be obtained.
- the first motion picture may be an image synthesized by compression of frames of a video sequence in RGB format
- the second motion picture may be an image synthesized by compression of a video sequence represented by depth information
- the first motion picture and the second motion picture may also be It is the compressed and synthesized image of the video sequences corresponding to other modalities respectively.
- the first video sequence is compressed to obtain a first motion map corresponding to the first video sequence, including:
- the first video sequence includes multiple frames of images, and each frame of image corresponds to a feature matrix; if the first video sequence is color video data in RGB format, the feature matrix of each frame of the first video sequence is a matrix of 3 channels * width W * height H, where width W and height H are in pixels, and the elements in the feature matrix correspond to pixels.
- the value of each element in the feature matrix represents the feature of the pixel at the corresponding position, such as a color image in RGB format, and each element represents the feature value of each pixel in the three channels of red R, green G, and blue B respectively.
- each image frame of the first video sequence corresponds to a feature matrix
- the elements at the same position in the feature matrices of all video frames are added, and then divided by the total frames of the video frames of the first video sequence number, obtain the element value at each position in the feature matrix, and round each element value, for example, 2.6 is rounded down to obtain 2, and the feature matrix of the first motion image corresponding to the first video sequence is obtained.
- FIG. 2 a schematic diagram of video data compression processing provided by an embodiment of the present application, when a video sequence is color video data in RGB format, the RGB video sequence is compressed to obtain a corresponding RGB motion picture, and a multi-frame image is The spatio-temporal information of , is synthesized into the spatio-temporal information of a motion map.
- the feature matrix of the motion map corresponding to the RGB video sequence may be a 3*W*H matrix. It can be calculated by the following formula:
- MI is the feature matrix of the corresponding motion picture of the first video sequence
- T is the total number of frames of the first video sequence
- I ⁇ is the feature matrix of the ⁇ th frame image in the first video sequence
- the value range of ⁇ is [1 , an integer of T].
- the value range of an element in the feature matrix of each frame of image of the first video sequence may be an integer of [0, 255], and the value of each element in the feature matrix of the motion image MI after the compression processing of the first video sequence is the value of each element
- the range is also an integer in the range [0,255].
- the second video sequence is compressed to obtain a second motion image corresponding to the second video sequence, including:
- the second video sequence includes multiple frames of images, and each frame of image corresponds to a feature matrix; if the second video sequence is an image sequence in which each video frame is represented by depth information, each The feature matrix of a frame image is a matrix of 1 channel * width W * height H, where width W and height H are in pixels, and elements in the feature matrix correspond to pixels. The value of each element in the feature matrix represents the feature of the pixel at the corresponding position.
- the depth map of each frame in the second video sequence can be gray-scaled, and the depth information of each pixel in the depth map can be converted by mapping [0, 255],
- the grayscale image of the video frame is obtained, and the value of each element in the feature matrix of the grayscale image is an integer in the range of [0,255].
- the value of the video sequence represented by depth information may be 0 to 10000mm, while the representation range of images in computer vision is [0, 255], so the video sequence represented by depth information needs to be scaled to match the visual representation.
- the value range is to map and convert the video sequence represented by the depth information to a grayscale image.
- the compression process of the second video sequence is similar to that of the first video sequence.
- a feature matrix of the grayscale image is obtained. Add the elements at the same position of the feature matrix of the grayscale images corresponding to all video frames in , and divide by the total number of video frames of the second video sequence to obtain the element value at each position in the feature matrix. The value of each element is rounded to obtain the feature matrix of the motion map corresponding to the second video sequence.
- FIG. 2 a schematic diagram of video data compression processing provided by an embodiment of the present application, when a video sequence is a video sequence represented by depth information, the depth video sequence is subjected to grayscale processing to obtain a grayscale image corresponding to the depth video sequence , compress the grayscale image to obtain the corresponding depth motion map, and synthesize the spatiotemporal information of multiple frames of images into the spatiotemporal information of one motion image.
- the feature matrix of the motion map corresponding to the depth video sequence may be a 1*W*H matrix. It can be calculated by the following formula:
- MJ is the feature matrix of the motion picture corresponding to the second video sequence
- N is the total number of frames of the second video sequence
- In is the feature matrix of the nth frame image in the second video sequence
- the value range of n is [1 , an integer of N].
- N and T may be equal, and the values of n and ⁇ may be equal, that is, the video frames of the first video sequence and the video frames of the second video sequence correspond one-to-one in time sequence.
- the value range of an element in the feature matrix of each frame of grayscale image corresponding to the second video sequence may be an integer of [0,255], and the value of each element in the feature matrix of the motion image MJ corresponding to the second video sequence
- the range can be an integer in the range [0,255].
- the video frames in the first video sequence in the RGB format and the video frames in the second video sequence represented by depth information may be in one-to-one correspondence.
- the grayscale image sequence obtained by performing grayscale processing on the video frames in the second video sequence represented by the depth information is also in one-to-one correspondence with the video frames in the first video sequence in the RGB format.
- Step S103 inputting the first motion map and the second motion map into the trained dual-stream neural network model, and performing interactive learning on the features of the first motion map and the second motion map through the trained dual-stream neural network model to obtain the first motion map.
- the dual-stream neural network model is an overall model including two independent convolutional neural network models and a routing module.
- the dual-stream neural network model includes two inputs and two outputs. Among them, the two inputs correspond to the feature information of the two modalities of the video data respectively, and the two outputs correspond to the prediction results of the input information of the two modalities respectively.
- the dual-stream neural network model includes two independent convolutional neural network models and routing modules, and the inputs of the two convolutional neural network models are respectively are the first motion map and the second motion map;
- the convolutional neural network model of each channel includes multiple convolutional layers, such as the convolutional module Conv1, the convolutional module Conv2_x, the convolutional module Conv5_x and the fully connected layer, wherein the convolutional module Conv2_x and convolution module Conv5_x respectively represent a total convolution module, and a total convolution module may include a number of convolution layers or convolution calculation units.
- the output of the previous module is interactively learned through the routing module, and the output of the routing module is superimposed with the output of the previous convolution module as the output of the next convolution module.
- the mid-level interaction features of different modalities in the dual-stream neural network model are learned through the routing module.
- the basic network of the two-way convolutional neural network model can be a residual network (ResNet). Due to the high modularity of the residual network, each module in the residual network can be used as the basic module for the first motion map and the second. The feature information of different modalities of the motion map is used for model training and interactive learning of features.
- the dual-stream neural network model optimizes and trains the model through dual loss functions.
- the basic network model of the dual-stream neural network model can be a deep network model such as Inception, ImageNet, TSN, and dual-stream network; the parameters of the basic network model are trained and adjusted by fine-tuning; the network model can also be designed as needed to adjust the parameters. Training set adjustment.
- the joint optimization training is performed through the dual-loss function to obtain the dual-stream high-level features of the modalities corresponding to the input image features of different modalities; for example, the input modalities are For moving images in RGB format and moving images represented by depth information, dual-stream high-level features of two modalities of RGB format and depth information can be obtained.
- the two inputs may include multiple channel inputs; for example, if one of the inputs is an RGB motion image, the input may include three channel inputs, corresponding to the features of the red R channel of the input RGB motion image respectively. matrix, feature matrix for the green G channel, and feature matrix for the blue B channel.
- the trained dual-stream neural network model includes a first neural network model, a second neural network model, and a routing module, and the routing module is disposed between the intermediate convolution module of the first neural network model and the second neural network model. between intermediate convolution modules; the input of the first neural network model is the first motion map, and the output is the first prediction result of the first video sequence; the input of the second neural network model is the second motion map, and the output is the the second prediction result of the second video sequence; the routing module is used for, between the intermediate convolution module of the first neural network model and the intermediate convolution module of the second neural network model, for each of the two-stream neural network model The output features of the layer convolution module are interactively learned.
- the architecture diagram of the dual-stream neural network model corresponds to one channel of input and output
- the second neural network model corresponds to another channel of input and output.
- the first motion picture input into the first neural network model can be an RGB motion picture
- the first prediction result output by the first neural network model is the identification result corresponding to the first video sequence
- the first video sequence can be an RGB video sequence in RGB format
- RGB motion pictures are obtained by compressing RGB video sequences in RGB format.
- the second motion map input into the second neural network model can be a depth motion map
- the second prediction result output by the second neural network model is the recognition result corresponding to the second video sequence
- the second video sequence can be the depth represented by the depth information.
- Video sequence; the depth motion map is obtained by compressing the depth video sequence represented by depth information.
- the middle layer of the dual-stream neural network it includes multiple convolution modules and multiple routing modules, such as the convolution module Conv1, the convolution module Conv2_x and the convolution module Conv5_x as shown in Figure 4; the routing module is set in the two-way convolution neural network. After each convolution module of the network model, the output of the previous module is interactively learned through the routing module. The output of the routing module is superimposed with the output of the previous convolution module as the input of the next convolution module. The module learns mid-level interaction features of different modalities in a two-stream neural network model.
- the intermediate convolution module of the first neural network model includes a first convolution module with a preset number of layers
- the intermediate convolution module of the second neural network model includes a second convolution module corresponding to the first convolution module. Convolution module.
- the first motion map and the second motion map are input into the trained dual-stream neural network model, and the features of the first motion map and the features of the second motion map are interactively learned through the trained dual-stream neural network model. , obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence, including:
- the first convolution module of the first layer and the first convolution module of the second layer are two adjacent convolution layers in the middle convolution module of the first neural network model;
- the second convolution module of one layer and the second convolution module of the second layer are two adjacent convolution layers in the middle convolution module of the second neural network model;
- the routing module and the routing module of the second layer are two adjacent computing modules.
- a convolution module includes a number of convolution layers or convolution calculation units; a convolution layer can be a set of parallel feature maps, by sliding different convolutions on the input image and performing certain At each sliding position, an element-corresponding product and sum operation is performed between the convolution kernel and the input image to project the information in the receptive field to an element in the feature map.
- the size of the convolution kernel is smaller than the size of the input image, and overlaps or acts on the input image in parallel. All elements in the feature map output by the convolution module of each layer in the middle of the dual-stream neural network model are obtained through a convolution The kernel is calculated.
- the dual-stream neural network model further includes a fully connected layer, a first loss function and a second loss function.
- the features output by the convolution module Conv5_x are used as the input of a fully connected layer, and the output features of the routing module of the last layer are used as the input of a fully connected layer.
- the results of the two fully connected layers are added as The output of the total fully connected layer, the first prediction result and the second prediction result are obtained.
- the routing module includes: a first convolution unit, a first normalization unit, a first activation unit, a second convolution unit, a second normalization unit, and a second activation unit;
- the first convolution unit, the first normalization unit, the first activation unit, the second convolution unit, the second normalization unit, and the second activation unit sequentially output the convolution calculation module of the first neural network model.
- the feature matrix and the feature matrix output by the convolution calculation module of the second neural network model are interactively learned to obtain the feature matrix output by the routing module.
- the routing module includes two layers of convolution units, two layers of normalization units, and two layers of activation units; they can be the first convolution unit Conv1D, the first normalization unit Batch Normlization, the first activation unit ReLU, and the second convolution unit.
- the output of the two-way convolution module of each layer of the intermediate convolution module of the dual-stream neural network model is used as the input of the corresponding routing module, and the output of each layer of the routing module is used as the input of the next layer of convolution module or the whole. Input to the connection layer.
- the routing module can be a 1*1 convolution-based computing unit; the output of the two-way convolution module of the previous layer is output to the convolution of the subsequent layer after learning and redirection of the 1*1 convolution module.
- the output of the two-way convolution module can be the information flow of multi-modal image features, such as the information flow of RGB format and the information flow of depth image features.
- the first neural network model includes a first loss function
- the second neural network model includes a second loss function
- the first neural network model, the second neural network model and the routing module are trained through video sample data, Adjust the parameters of the first neural network model, the parameters of the second neural network model, and the parameters of the routing module respectively according to the first loss function and the second loss function; if the first loss function and the second loss function meet the preset threshold, stop The parameters of the first neural network model, the parameters of the second neural network model and the routing module are trained to obtain a trained dual-stream neural network model.
- the dual-stream neural network model is optimized and trained with dual loss functions. According to the output result of the fully connected layer of the convolutional neural network of the first channel, the parameters of the convolutional neural network of the first channel are trained and adjusted through the first loss function; according to the fully connected layer of the convolutional neural network of the second channel The output result is used to train and adjust the parameters of the convolutional neural network of the second channel through the second loss function; at the same time, the parameters of the routing module are trained and adjusted through the first loss function and the second loss function.
- Step S104 based on the first prediction result and the second prediction result, determine the classification result of the action to be recognized.
- the first prediction result and the second prediction result are multimodal dual-stream high-level features output by the trained neural network model.
- Feature fusion is performed on the dual-stream high-level features to obtain the final output result in the network architecture of the dual-stream neural network model.
- the final output result is a one-dimensional score vector (probability), and the final classification result is determined according to the highest probability in the score vector; that is, the category corresponding to the highest score is the classification result of the action to be recognized.
- determining the classification result of the action to be recognized includes:
- Feature fusion is performed on the first prediction result and the second prediction result to obtain the probability distribution of action categories
- feature fusion is a calculation process in the network architecture of the dual-stream neural network model, that is, after the dual-stream neural network model obtains the feature information of the RGB format information flow and the depth information flow, it will be fused, and the fusion will be performed after the fusion. Probability mapping, and finally category judgment.
- the final output result is a one-dimensional score vector (probability)
- the score vector is a one-dimensional vector containing 10 elements, each element is the probability of 0 to 1, and the sum of the 10 elements is 1, assuming the first The maximum value of the two elements is 0.3, and the classification result of the action to be recognized is determined to be the second category.
- the process of feature fusion can perform fusion calculation by performing point multiplication, weighted addition or maximum value of the two matrices finally output by the network architecture to obtain the final probability distribution, which is determined according to the category corresponding to the maximum value in the probability distribution.
- the type of action to be recognized can perform fusion calculation by performing point multiplication, weighted addition or maximum value of the two matrices finally output by the network architecture to obtain the final probability distribution, which is determined according to the category corresponding to the maximum value in the probability distribution.
- the terminal device can obtain the video data of the action to be recognized, the video data includes the first video sequence and the second video sequence, and the first video sequence and the second video sequence are respectively compressed to obtain the first motion picture and the second video sequence.
- the second motion map which provides a richer spatiotemporal representation of the video data, makes the information representation more complete and the features richer; thus, the first motion map and the second motion map are used as the input of the dual-stream neural network model, and the multi-modality model is analyzed by the neural network model.
- the interactive learning of state image features improves the accuracy of action recognition.
- FIG. 6 shows a structural block diagram of the motion recognition apparatus provided by the embodiments of the present application. For convenience of description, only the parts related to the embodiments of the present application are shown.
- the device includes:
- an acquisition unit 61 configured to acquire video data of the action to be identified, the video data including a first video sequence and a second video sequence;
- a processing unit 62 configured to perform compression processing on the first video sequence and the second video sequence respectively to obtain a first motion map corresponding to the first video sequence and a second motion corresponding to the second video sequence picture;
- the computing unit 63 is used for inputting the first motion map and the second motion map into the trained dual-stream neural network model, and through the trained dual-stream neural network model, the features of the first motion map and all parameters are analyzed. performing interactive learning on the features of the second motion map to obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence;
- An output unit 64 configured to determine a classification result of the to-be-recognized action based on the first prediction result and the second prediction result.
- the terminal device can obtain the video data of the action to be recognized, the video data includes the first video sequence and the second video sequence, and the first video sequence and the second video sequence are respectively compressed to obtain the first motion picture and the second video sequence.
- the second motion map which provides a richer spatiotemporal representation of the video data, makes the information representation more complete and the features richer; thus, the first motion map and the second motion map are used as the input of the dual-stream neural network model, and the multi-modality model is analyzed by the neural network model.
- the interactive learning of state image features improves the accuracy of action recognition.
- FIG. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
- the terminal device 7 in this embodiment includes: at least one processor 70 (only one is shown in FIG. 7 ), a memory 71 , and a processor stored in the memory 71 and can be processed in the at least one processor
- the computer program 72 running on the processor 70 when the processor 70 executes the computer program 72, implements the steps in any of the foregoing embodiments of the method for identifying each training board.
- the terminal device 7 may include, but is not limited to, a processor 70 and a memory 71 .
- FIG. 7 is only an example of the terminal device 7, and does not constitute a limitation on the terminal device 7, and may include more or less components than the one shown, or combine some components, or different components , for example, may also include input and output devices, network access devices, and the like.
- the so-called processor 70 may be a central processing unit (Central Processing Unit, CPU), and the processor 70 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuits) , ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
- a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
- the memory 71 may be an internal storage unit of the terminal device 7 in some embodiments, such as a hard disk or a memory of the terminal device 7 . In other embodiments, the memory 71 may also be an external storage device of the terminal device 7, such as a plug-in hard disk equipped on the terminal device 7, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Further, the memory 71 may also include both an internal storage unit of the terminal device 7 and an external storage device. The memory 71 is used to store operating systems, application programs, bootloaders (BootLoader), data, and other programs, such as program codes of the computer programs, and the like. The memory 71 may also be used to temporarily store data that has been output or will be output.
- the memory 71 may also be used to temporarily store data that has been output or will be output.
- Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps in the foregoing method embodiments can be implemented.
- the embodiments of the present application provide a computer program product, when the computer program product runs on a mobile terminal, the steps in the foregoing method embodiments can be implemented when the mobile terminal executes the computer program product.
- the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
- the present application realizes all or part of the processes in the methods of the above embodiments, which can be completed by instructing the relevant hardware through a computer program, and the computer program can be stored in a computer-readable storage medium.
- the computer program includes computer program code
- the computer program code may be in the form of source code, object code, executable file or some intermediate form, and the like.
- the computer-readable medium may include at least: any entity or device capable of carrying the computer program code to the photographing device/terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electrical carrier signals, telecommunication signals, and software distribution media.
- ROM read-only memory
- RAM random access memory
- electrical carrier signals telecommunication signals
- software distribution media For example, U disk, mobile hard disk, disk or CD, etc.
- computer readable media may not be electrical carrier signals and telecommunications signals.
- the disclosed apparatus/network device and method may be implemented in other manners.
- the apparatus/network device embodiments described above are only illustrative.
- the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units. Or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented.
- the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Biophysics (AREA)
- Signal Processing (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims (10)
- 一种基于特征交互学习的动作识别方法,其特征在于,包括:An action recognition method based on feature interaction learning, characterized in that it includes:获取待识别动作的视频数据,所述视频数据包括第一视频序列和第二视频序列;acquiring video data of an action to be identified, the video data including a first video sequence and a second video sequence;将所述第一视频序列和所述第二视频序列分别进行压缩处理,得到所述第一视频序列对应的第一运动图和所述第二视频序列对应的第二运动图;compressing the first video sequence and the second video sequence respectively, to obtain a first motion picture corresponding to the first video sequence and a second motion picture corresponding to the second video sequence;将所述第一运动图和所述第二运动图输入训练后的双流神经网络模型,通过所述训练后的双流神经网络模型对所述第一运动图的特征和所述第二运动图的特征进行交互学习,得到所述训练后的双流神经网络模型输出的所述第一视频序列的第一预测结果和所述第二视频序列的第二预测结果;The first motion map and the second motion map are input into the trained dual-stream neural network model, and the features of the first motion map and the second motion map are analyzed by the trained dual-stream neural network model. The feature is interactively learned to obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence output by the trained dual-stream neural network model;基于所述第一预测结果和所述第二预测结果,确定所述待识别动作的分类结果。Based on the first prediction result and the second prediction result, a classification result of the action to be recognized is determined.
- 如权利要求1所述的方法,其特征在于,所述将所述第一视频序列进行压缩处理,得到所述第一视频序列对应的第一运动图,包括:The method according to claim 1, wherein the compressing the first video sequence to obtain the first motion picture corresponding to the first video sequence comprises:获取所述第一视频序列中每一视频帧的特征矩阵;Obtain the feature matrix of each video frame in the first video sequence;根据所述第一视频序列中视频帧的时序,将每一视频帧的所述特征矩阵进行压缩计算,得到用于表示所述第一运动图的特征矩阵。According to the time sequence of the video frames in the first video sequence, the feature matrix of each video frame is compressed and calculated to obtain a feature matrix for representing the first motion image.
- 如权利要求1所述的方法,其特征在于,所述将所述第二视频序列进行压缩处理,得到所述第二视频序列对应的第二运动图,包括:The method of claim 1, wherein the compressing the second video sequence to obtain a second motion picture corresponding to the second video sequence comprises:将所述第二视频序列进行灰度处理,得到所述第二视频序列对应的灰度序列帧;performing grayscale processing on the second video sequence to obtain a grayscale sequence frame corresponding to the second video sequence;根据所述第二视频序列中视频帧的时序,将所述灰度序列帧的特征矩阵进行压缩计算,得到用于表示所述第二运动图的特征矩阵。According to the time sequence of the video frames in the second video sequence, the feature matrix of the grayscale sequence frame is compressed and calculated to obtain the feature matrix used to represent the second motion image.
- 如权利要求1所述的方法,其特征在于,所述训练后的双流神经网络模型包括第一神经网络模型、第二神经网络模型和路由模块,所述路由模块设置于所述第一神经网络模型的中间卷积模块和所述第二神经网络模型的中间卷积模块之间;The method of claim 1, wherein the trained dual-stream neural network model comprises a first neural network model, a second neural network model and a routing module, and the routing module is set in the first neural network between the intermediate convolution module of the model and the intermediate convolution module of the second neural network model;所述第一神经网络模型的输入为所述第一运动图,输出为所述第一视频序列的所述第一预测结果;The input of the first neural network model is the first motion map, and the output is the first prediction result of the first video sequence;所述第二神经网络模型的输入为所述第二运动图,输出为所述第二视频序列的所述第二预测结果;The input of the second neural network model is the second motion map, and the output is the second prediction result of the second video sequence;所述路由模块用于在所述第一神经网络模型的中间卷积模块和所述第二神经网络模型的中间卷积模块之间,对所述第一运动图的特征和所述第二运动图的特征进行交互学习。The routing module is configured to, between the intermediate convolution module of the first neural network model and the intermediate convolution module of the second neural network model, compare the features of the first motion map and the second motion Interactive learning of graph features.
- 如权利要求4所述的方法,其特征在于,所述第一神经网络模型的中间卷积模块包括预设层数的第一卷积模块,所述第二神经网络模型的中间卷积模块包括与所述第一卷积模块相对应的第二卷积模块;The method of claim 4, wherein the intermediate convolution module of the first neural network model comprises a first convolution module with a preset number of layers, and the intermediate convolution module of the second neural network model comprises a second convolution module corresponding to the first convolution module;所述将所述第一运动图和所述第二运动图输入训练后的双流神经网络模型,通过所述训练后的双流神经网络模型对所述第一运动图的特征和所述第二运动图的特征进行交互学习,得到所述第一视频序列的第一预测结果和所述第二视频序列的第二预测结果,包括:The first motion map and the second motion map are input into the trained dual-stream neural network model, and the features of the first motion map and the second motion are analyzed by the trained dual-stream neural network model. The features of the graph are interactively learned to obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence, including:将第一层的第一卷积模块的输出和第一层的第二卷积模块的输出作为第一层的路由模块的输入,由所述第一层的路由模块进行特征交互学习,得到第一路由输出;The output of the first convolution module of the first layer and the output of the second convolution module of the first layer are used as the input of the routing module of the first layer, and the feature interactive learning is performed by the routing module of the first layer, and the first layer is obtained. a route output;将所述第一层的第一卷积模块的输出与所述第一路由输出的叠加结果作为第二层的第一卷积模块的输入,由所述第二层的第一卷积模块进行特征学习,得到所述第二层的第一卷积模块的输出;The superposition result of the output of the first convolution module of the first layer and the output of the first route is used as the input of the first convolution module of the second layer, and the first convolution module of the second layer performs feature learning to obtain the output of the first convolution module of the second layer;将所述第一层的第二卷积模块的输出与所述第一路由输出的叠加结果作为第二层的第二卷积模块的输入,由所述第二层的第二卷积模块进行特征学习,得到所述第二层的第二卷积模块的输出;The superposition result of the output of the second convolution module of the first layer and the output of the first route is used as the input of the second convolution module of the second layer, and the second convolution module of the second layer performs feature learning to obtain the output of the second convolution module of the second layer;将所述第二层的第一卷积模块的输出和所述第二层的第二卷积模块的输出作为第二层的路由模块的输入,由所述第二层的路由模块进行特征交互学习,得到第二路由输出;The output of the first convolution module of the second layer and the output of the second convolution module of the second layer are used as the input of the routing module of the second layer, and the feature interaction is performed by the routing module of the second layer learn, get the second route output;其中,所述第一层的第一卷积模块与第二层的第一卷积模块为所述第一神经网络模型的中间卷积模块中前后相邻的两层卷积层;所述第一层的第二卷积模块和所述第二层的第二卷积模块为所述第二神经网络模型的中间卷积模块中前后相邻的两层卷积层;所述第一层的路由模块和所述第二层的路由模块为前后相邻的两个计算模块。Wherein, the first convolution module of the first layer and the first convolution module of the second layer are two adjacent convolution layers in the middle convolution module of the first neural network model; The second convolution module of one layer and the second convolution module of the second layer are two adjacent convolution layers in the middle convolution module of the second neural network model; The routing module and the routing module of the second layer are two adjacent computing modules.
- 如权利要求4所述的方法,其特征在于,所述路由模块包括:第一卷积单元、第一归一化单元、第一激活单元、第二卷积单元、第二归一化单元、第二激活单元;The method of claim 4, wherein the routing module comprises: a first convolution unit, a first normalization unit, a first activation unit, a second convolution unit, a second normalization unit, the second activation unit;通过所述路由模块的所述第一卷积单元、所述第一归一化单元、所述第一激活单元、所述第二卷积单元、所述第二归一化单元、所述第二激活单元,依次对所述第一神经网络模型的卷积计算模块输出的特征矩阵和所述第二神经网络模型的卷积算计模块输出的特征矩阵进行交互学习,得到所述路由模块输出的特征矩阵。Through the first convolution unit, the first normalization unit, the first activation unit, the second convolution unit, the second normalization unit, the first The second activation unit performs interactive learning on the feature matrix output by the convolution calculation module of the first neural network model and the feature matrix output by the convolution calculation module of the second neural network model in turn, and obtains the output of the routing module. feature matrix.
- 如权利要求1至6任一项所述的方法,其特征在于,所述基于所述第一预测结果和所述第二预测结果,确定所述待识别动作的分类结果,包括:The method according to any one of claims 1 to 6, wherein the determining the classification result of the to-be-recognized action based on the first prediction result and the second prediction result comprises:对所述第一预测结果和所述第二预测结果进行特征融合,得到动作类别的概率分布;Perform feature fusion on the first prediction result and the second prediction result to obtain a probability distribution of action categories;将所述概率分布中概率最大的动作类别作为所述待识别动作的所述分类结果。The action category with the highest probability in the probability distribution is used as the classification result of the action to be identified.
- 如权利要求4所述的方法,其特征在于,所述第一神经网络模型包括第一损失函数,所述第二神经网络模型包括第二损失函数;The method of claim 4, wherein the first neural network model includes a first loss function, and the second neural network model includes a second loss function;通过样本视频数据对所述第一神经网络模型、所述第二神经网络模型和所述路由模块进行训练,依据所述第一损失函数和所述第二损失函数分别调整所述第一神经网络模型的参数、所述第二神经网络模型的参数以及所述路由模块的参数;The first neural network model, the second neural network model and the routing module are trained through sample video data, and the first neural network is adjusted according to the first loss function and the second loss function respectively parameters of the model, parameters of the second neural network model, and parameters of the routing module;若所述第一损失函数和所述第二损失函数满足预设阈值,则停止对所述第一神经网络模型的参数、所述第二神经网络模型的参数以及所述路由模块的训练,得到所述训练后的双流神经网络模型。If the first loss function and the second loss function meet the preset threshold, stop the training of the parameters of the first neural network model, the parameters of the second neural network model and the routing module, and obtain The trained dual-stream neural network model.
- 一种终端设备,其特征在于,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1至8任一项所述的方法。A terminal device, characterized by comprising a memory, a processor, and a computer program stored in the memory and running on the processor, the processor implementing the computer program according to claims 1 to 1 when the processor executes the computer program. 8. The method of any one.
- 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至8任一项所述的方法。A computer-readable storage medium storing a computer program, characterized in that, when the computer program is executed by a processor, the method according to any one of claims 1 to 8 is implemented.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011078182.6A CN112257526B (en) | 2020-10-10 | 2020-10-10 | Action recognition method based on feature interactive learning and terminal equipment |
CN202011078182.6 | 2020-10-10 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022073282A1 true WO2022073282A1 (en) | 2022-04-14 |
Family
ID=74241911
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/129550 WO2022073282A1 (en) | 2020-10-10 | 2020-11-17 | Motion recognition method based on feature interactive learning, and terminal device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112257526B (en) |
WO (1) | WO2022073282A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114898461A (en) * | 2022-04-28 | 2022-08-12 | 河海大学 | Human body behavior identification method based on double-current non-local space-time convolution neural network |
CN115174995A (en) * | 2022-07-04 | 2022-10-11 | 北京国盛华兴科技有限公司 | Frame insertion method and device for video data |
CN116226661A (en) * | 2023-01-04 | 2023-06-06 | 浙江大邦科技有限公司 | Device and method for monitoring equipment state operation |
CN116434335A (en) * | 2023-03-30 | 2023-07-14 | 东莞理工学院 | Method, device, equipment and storage medium for identifying action sequence and deducing intention |
CN117556381A (en) * | 2024-01-04 | 2024-02-13 | 华中师范大学 | Knowledge level depth mining method and system for cross-disciplinary subjective test questions |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115666387A (en) * | 2021-03-19 | 2023-01-31 | 京东方科技集团股份有限公司 | Electrocardiosignal identification method and electrocardiosignal identification device based on multiple leads |
CN113326835B (en) * | 2021-08-04 | 2021-10-29 | 中国科学院深圳先进技术研究院 | Action detection method and device, terminal equipment and storage medium |
CN115100740B (en) * | 2022-06-15 | 2024-04-05 | 东莞理工学院 | Human motion recognition and intention understanding method, terminal equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110020658A (en) * | 2019-03-28 | 2019-07-16 | 大连理工大学 | A kind of well-marked target detection method based on multitask deep learning |
CN110175596A (en) * | 2019-06-04 | 2019-08-27 | 重庆邮电大学 | The micro- Expression Recognition of collaborative virtual learning environment and exchange method based on double-current convolutional neural networks |
CN110633630A (en) * | 2019-08-05 | 2019-12-31 | 中国科学院深圳先进技术研究院 | Behavior identification method and device and terminal equipment |
CN111199238A (en) * | 2018-11-16 | 2020-05-26 | 顺丰科技有限公司 | Behavior identification method and equipment based on double-current convolutional neural network |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107220616B (en) * | 2017-05-25 | 2021-01-19 | 北京大学 | Adaptive weight-based double-path collaborative learning video classification method |
CN107862376A (en) * | 2017-10-30 | 2018-03-30 | 中山大学 | A kind of human body image action identification method based on double-current neutral net |
CN107808150A (en) * | 2017-11-20 | 2018-03-16 | 珠海习悦信息技术有限公司 | The recognition methods of human body video actions, device, storage medium and processor |
US11600387B2 (en) * | 2018-05-18 | 2023-03-07 | Htc Corporation | Control method and reinforcement learning for medical system |
CN110555340B (en) * | 2018-05-31 | 2022-10-18 | 赛灵思电子科技(北京)有限公司 | Neural network computing method and system and corresponding dual neural network implementation |
CN110287820B (en) * | 2019-06-06 | 2021-07-23 | 北京清微智能科技有限公司 | Behavior recognition method, device, equipment and medium based on LRCN network |
CN111027377B (en) * | 2019-10-30 | 2021-06-04 | 杭州电子科技大学 | Double-flow neural network time sequence action positioning method |
-
2020
- 2020-10-10 CN CN202011078182.6A patent/CN112257526B/en active Active
- 2020-11-17 WO PCT/CN2020/129550 patent/WO2022073282A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111199238A (en) * | 2018-11-16 | 2020-05-26 | 顺丰科技有限公司 | Behavior identification method and equipment based on double-current convolutional neural network |
CN110020658A (en) * | 2019-03-28 | 2019-07-16 | 大连理工大学 | A kind of well-marked target detection method based on multitask deep learning |
CN110175596A (en) * | 2019-06-04 | 2019-08-27 | 重庆邮电大学 | The micro- Expression Recognition of collaborative virtual learning environment and exchange method based on double-current convolutional neural networks |
CN110633630A (en) * | 2019-08-05 | 2019-12-31 | 中国科学院深圳先进技术研究院 | Behavior identification method and device and terminal equipment |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114898461A (en) * | 2022-04-28 | 2022-08-12 | 河海大学 | Human body behavior identification method based on double-current non-local space-time convolution neural network |
CN115174995A (en) * | 2022-07-04 | 2022-10-11 | 北京国盛华兴科技有限公司 | Frame insertion method and device for video data |
CN116226661A (en) * | 2023-01-04 | 2023-06-06 | 浙江大邦科技有限公司 | Device and method for monitoring equipment state operation |
CN116434335A (en) * | 2023-03-30 | 2023-07-14 | 东莞理工学院 | Method, device, equipment and storage medium for identifying action sequence and deducing intention |
CN116434335B (en) * | 2023-03-30 | 2024-04-30 | 东莞理工学院 | Method, device, equipment and storage medium for identifying action sequence and deducing intention |
CN117556381A (en) * | 2024-01-04 | 2024-02-13 | 华中师范大学 | Knowledge level depth mining method and system for cross-disciplinary subjective test questions |
CN117556381B (en) * | 2024-01-04 | 2024-04-02 | 华中师范大学 | Knowledge level depth mining method and system for cross-disciplinary subjective test questions |
Also Published As
Publication number | Publication date |
---|---|
CN112257526A (en) | 2021-01-22 |
CN112257526B (en) | 2023-06-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022073282A1 (en) | Motion recognition method based on feature interactive learning, and terminal device | |
US11551333B2 (en) | Image reconstruction method and device | |
CN110532871B (en) | Image processing method and device | |
CN109359592B (en) | Video frame processing method and device, electronic equipment and storage medium | |
WO2021043168A1 (en) | Person re-identification network training method and person re-identification method and apparatus | |
WO2020192483A1 (en) | Image display method and device | |
KR20230013243A (en) | Maintain a fixed size for the target object in the frame | |
CN113066017B (en) | Image enhancement method, model training method and equipment | |
CN110222717B (en) | Image processing method and device | |
CN108388882B (en) | Gesture recognition method based on global-local RGB-D multi-mode | |
CN111310676A (en) | Video motion recognition method based on CNN-LSTM and attention | |
WO2020082382A1 (en) | Method and system of neural network object recognition for image processing | |
CN110070107A (en) | Object identification method and device | |
WO2021073311A1 (en) | Image recognition method and apparatus, computer-readable storage medium and chip | |
CN111079507B (en) | Behavior recognition method and device, computer device and readable storage medium | |
WO2021047587A1 (en) | Gesture recognition method, electronic device, computer-readable storage medium, and chip | |
WO2022104026A1 (en) | Consistency measure for image segmentation processes | |
WO2020092276A1 (en) | Video recognition using multiple modalities | |
CN110222718A (en) | The method and device of image procossing | |
WO2022205329A1 (en) | Object detection method, object detection apparatus, and object detection system | |
KR101344851B1 (en) | Device and Method for Processing Image | |
CN110633630B (en) | Behavior identification method and device and terminal equipment | |
CN113489958A (en) | Dynamic gesture recognition method and system based on video coding data multi-feature fusion | |
WO2021189321A1 (en) | Image processing method and device | |
CN116824641A (en) | Gesture classification method, device, equipment and computer storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20956597 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20956597 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20956597 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 14-12-2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20956597 Country of ref document: EP Kind code of ref document: A1 |