WO2022073282A1

WO2022073282A1 - Motion recognition method based on feature interactive learning, and terminal device

Info

Publication number: WO2022073282A1
Application number: PCT/CN2020/129550
Authority: WO
Inventors: 任子良; 程俊; 张锲石; 高向阳; 康宇航
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2020-10-10
Filing date: 2020-11-17
Publication date: 2022-04-14
Also published as: CN112257526A; CN112257526B

Abstract

A motion recognition method based on feature interactive learning, and a terminal device, which are applicable to the technical field of computer vision. The method comprises: acquiring video data of a motion to be recognized, the video data comprising a first video sequence and a second video sequence (S101); respectively performing compression processing on the first video sequence and the second video sequence, so as to obtain a first motion graph and a second motion graph (S102); inputting the first motion graph and the second motion graph into a trained two-stream neural network model, and performing interactive learning on features of the first motion graph and features of the second motion graph by means of the trained two-stream neural network model, so as to obtain a first prediction result of the first video sequence and a second prediction result of the second video sequence (S103); and on the basis of the first prediction result and the second prediction result, determining a classification result of the motion to be recognized (S104). According to the method, the problem of low motion recognition accuracy of sparse sampling is solved. By means of interactive learning of multi-modal input features, the accuracy of motion category recognition is improved.

Description

An action recognition method and terminal device based on feature interactive learning

technical field

The present application belongs to the technical field of computer vision, and in particular, relates to an action recognition method and terminal device based on feature interaction learning.

Background technique

In recent years, human action recognition has become one of the research hotspots in the field of computer vision. Through action recognition technology, computers can automatically understand and describe human actions in videos, which has great application value in many fields, such as video surveillance, human-computer interaction, motion analysis, content-based video retrieval, and autonomous driving. The methods of human action recognition mainly include methods based on artificially designed features and methods based on neural network deep learning features.

Compared with the traditional methods based on artificially designed features, the methods based on neural network deep learning features have achieved certain success in the recognition of human actions. However, in the current human action recognition method based on neural network deep learning, when processing the action classification and recognition of long video sequences, a certain number of video frames are obtained through sparse sampling as the input of the neural network, and the video frames are extracted layer by layer through the neural network. features, to identify and classify human actions; due to the complexity and variability of video shooting angles, shooting dimensions, shooting backgrounds, and the differences and similarities of actions, the sparse sampling method for a single modality has a low accuracy rate of action recognition. .

SUMMARY OF THE INVENTION

The embodiments of the present application provide an action recognition method and terminal device based on feature interaction learning, which can solve the problem of low accuracy of action recognition due to the sparse sampling method of a single modality.

In a first aspect, an embodiment of the present application provides an action recognition method based on feature interaction learning, the method includes: acquiring video data of an action to be recognized, the video data including a first video sequence and a second video sequence; Perform compression processing on the first video sequence and the second video sequence respectively to obtain a first motion picture corresponding to the first video sequence and a second motion picture corresponding to the second video sequence; The motion map and the second motion map are input into the trained dual-stream neural network model, and the features of the first motion map and the features of the second motion map are interactively learned through the trained dual-stream neural network model, Obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence output by the trained dual-stream neural network model; based on the first prediction result and the second prediction result , and determine the classification result of the action to be recognized.

In a possible implementation manner of the first aspect, performing compression processing on the first video sequence to obtain a first motion picture corresponding to the first video sequence, including:

Obtain the feature matrix of each video frame in the first video sequence; according to the time sequence of the video frames in the first video sequence, compress and calculate the feature matrix of each video frame, and obtain the feature matrix used to represent the first video frame. A feature matrix of a motion map.

In a possible implementation manner of the first aspect, performing compression processing on the second video sequence to obtain a second motion image corresponding to the second video sequence includes:

Perform grayscale processing on the second video sequence to obtain a grayscale sequence frame corresponding to the second video sequence; Compression calculation to obtain a feature matrix for representing the second motion map.

In a possible implementation manner of the first aspect, the trained dual-stream neural network model includes a first neural network model, a second neural network model, and a routing module, and the routing module is set in the first neural network between the intermediate convolution module of the model and the intermediate convolution module of the second neural network model; the input of the first neural network model is the first motion map, and the output is the first video sequence of the the first prediction result; the input of the second neural network model is the second motion map, and the output is the second prediction result of the second video sequence; the routing module is used for Between the intermediate convolution module of the network model and the intermediate convolution module of the second neural network model, interactive learning is performed on the features of the first motion map and the features of the second motion map.

In a possible implementation manner of the first aspect, the intermediate convolution module of the first neural network model includes a first convolution module with a preset number of layers, and the intermediate convolution module of the second neural network model includes A second convolution module corresponding to the first convolution module; the first motion map and the second motion map are input into the trained dual-stream neural network model, and the trained dual-stream neural network The network model performs interactive learning on the features of the first motion map and the features of the second motion map to obtain a first prediction result of the first video sequence and a second prediction result of the second video sequence, including :

The output of the first convolution module of the first layer and the output of the second convolution module of the first layer are used as the input of the routing module of the first layer, and the feature interactive learning is performed by the routing module of the first layer, and the first layer is obtained. A route output; take the superposition result of the output of the first convolution module of the first layer and the output of the first route as the input of the first convolution module of the second layer, and the first convolution module of the second layer The convolution module performs feature learning to obtain the output of the first convolution module of the second layer; the superposition result of the output of the second convolution module of the first layer and the output of the first route is used as the second layer The input of the second convolution module of the second layer, the feature learning is performed by the second convolution module of the second layer, and the output of the second convolution module of the second layer is obtained; the first volume of the second layer is The output of the product module and the output of the second convolution module of the second layer are used as the input of the routing module of the second layer, and the routing module of the second layer performs feature interactive learning to obtain the second routing output;

Wherein, the first convolution module of the first layer and the first convolution module of the second layer are two adjacent convolution layers in the middle convolution module of the first neural network model; The second convolution module of one layer and the second convolution module of the second layer are two adjacent convolution layers in the middle convolution module of the second neural network model; The routing module and the routing module of the second layer are two adjacent computing modules.

In a possible implementation manner of the first aspect, the routing module includes: a first convolution unit, a first normalization unit, a first activation unit, a second convolution unit, a second normalization unit, second activation unit; through the first convolution unit, the first normalization unit, the first activation unit, the second convolution unit, the second normalization unit of the routing module unit and the second activation unit, in turn, perform interactive learning on the feature matrix output by the convolution calculation module of the first neural network model and the feature matrix output by the convolution calculation module of the second neural network model, and obtain the The feature matrix output by the routing module described above.

In a possible implementation manner of the first aspect, the determining a classification result of the to-be-recognized action based on the first prediction result and the second prediction result includes:

Feature fusion is performed on the first prediction result and the second prediction result to obtain a probability distribution of action categories; the action category with the highest probability in the probability distribution is used as the classification result of the action to be recognized.

In a possible implementation manner of the first aspect, the first neural network model includes a first loss function, and the second neural network model includes a second loss function; The model, the second neural network model and the routing module are trained, and the parameters of the first neural network model and the second neural network model are adjusted according to the first loss function and the second loss function, respectively. parameters and the parameters of the routing module; if the first loss function and the second loss function meet the preset threshold, stop the parameters of the first neural network model and the second neural network model. parameters and the training of the routing module to obtain the trained dual-stream neural network model.

In a second aspect, an embodiment of the present application provides an action recognition device based on feature interactive learning, including:

an acquisition unit, configured to acquire video data of an action to be recognized, the video data including a first video sequence and a second video sequence;

A processing unit, configured to compress the first video sequence and the second video sequence respectively, to obtain a first motion picture corresponding to the first video sequence and a second motion picture corresponding to the second video sequence ;

The computing unit is used for inputting the first motion map and the second motion map into a trained dual-stream neural network model, and through the trained dual-stream neural network model, the features of the first motion map and the described The features of the second motion map are interactively learned to obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence;

An output unit, configured to determine a classification result of the action to be recognized based on the first prediction result and the second prediction result.

In a third aspect, an embodiment of the present application provides a terminal device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program When implementing the method described in the first aspect and possible implementation manners of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the first aspect and possible implementations of the first aspect are implemented method described.

In a fifth aspect, an embodiment of the present application provides a computer program product that, when the computer program product runs on a terminal device, enables the terminal device to execute the action recognition method described in any one of the first aspects above.

It can be understood that, for the beneficial effects of the second aspect to the fifth aspect, reference may be made to the relevant description of the first aspect, which will not be repeated here.

Compared with the prior art, the beneficial effects of the embodiments of the present application are: through the embodiments of the present application, the terminal device can obtain video data of actions to be recognized, and the video data includes a first video sequence and a second video sequence; The sequence and the second video sequence are respectively compressed to obtain the first motion map corresponding to the first video sequence and the second motion map corresponding to the second video sequence; the first motion map and the second motion map are input into the trained dual-stream neural network. The network model, through the trained dual-stream neural network model, interactively learns the features of the first motion map and the features of the second motion map, and obtains the first prediction result of the first video sequence and the second prediction result of the second video sequence; Based on the first prediction result and the second prediction result, the classification result of the action to be recognized is determined; the first motion picture and the second motion picture are obtained by compressing the first video sequence and the second video sequence respectively, and the video data is enriched The spatial and temporal representation of the model makes the information representation more complete and the features richer; thus, the first motion map and the second motion map are used as the input of the dual-stream neural network model, and the multi-modal image features are interactively learned through the neural network model. The accuracy of action recognition; has strong ease of use and practicality.

Description of drawings

In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only for the present application. In some embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.

1 is a schematic flowchart of an action recognition method provided by an embodiment of the present application;

2 is a schematic diagram of video data compression processing provided by an embodiment of the present application;

3 is a schematic diagram of a network architecture of a dual-stream neural network model provided by an embodiment of the present application;

4 is a schematic diagram of the architecture of a routing module of a dual-stream neural network provided by an embodiment of the present application;

5 is a schematic diagram of the architecture of a middle-level feature interaction learning unit of a dual-stream neural network provided by an embodiment of the present application;

6 is a schematic structural diagram of a motion recognition device provided by an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.

Detailed ways

In the following description, for the purpose of illustration rather than limitation, specific details such as a specific system structure and technology are set forth in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to those skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It is to be understood that, when used in this specification and the appended claims, the term "comprising" indicates the presence of the described feature, integer, step, operation, element and/or component, but does not exclude one or more other The presence or addition of features, integers, steps, operations, elements, components and/or sets thereof.

It will also be understood that, as used in this specification and the appended claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.

As used in the specification of this application and the appended claims, the term "if" may be contextually interpreted as "when" or "once" or "in response to determining" or "in response to detecting ". Similarly, the phrases "if it is determined" or "if the [described condition or event] is detected" may be interpreted, depending on the context, to mean "once it is determined" or "in response to the determination" or "once the [described condition or event] is detected. ]" or "in response to detection of the [described condition or event]".

In addition, in the description of the specification of the present application and the appended claims, the terms "first", "second", "third", etc. are only used to distinguish the description, and should not be construed as indicating or implying relative importance.

References in this specification to "one embodiment" or "some embodiments" and the like mean that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in other embodiments," etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean "one or more but not all embodiments" unless specifically emphasized otherwise. The terms "including", "including", "having" and their variants mean "including but not limited to" unless specifically emphasized otherwise.

At present, for computer vision tasks, models such as 2D convolutional network 2D-ConvNets, 3D convolutional network 3D-ConvNets, and recurrent neural network (RNN) based on convolutional neural networks are mainly used. The feature learning ability and action recognition effect of convolutional neural networks have achieved certain results, but there are still many challenges in the process of action classification processing and recognition for long video sequences.

Human action recognition method based on convolutional neural network, given a certain number of RGB or depth video sequences, a certain number of video frames are obtained as the input of the network through the sparse sampling method, and the convolutional neural network extracts the features in the video frames layer by layer, And the human actions are classified and recognized through a classifier or normalization (Softmax function) layer.

Among them, human action recognition methods based on convolutional neural networks can be divided into the following two categories: First, 2D end-to-end network training structure: supervised training of deep networks through large-scale labeled datasets, and then through parameter fine-tuning training Get the trained model for the actual task. For video sequences, this method mainly uses sparse sampling to obtain a certain frame of the entire video sequence as the network input, and cannot learn the temporal dimension features of human actions well. Second, the 3D end-to-end network training structure: a few frames of images are obtained through sparse sampling as the input of the network model, and the classification model is obtained through supervised training and parameter fine-tuning training. This method can obtain better recognition effect, but the huge amount of calculation restricts its application in practical scenarios.

In addition, human action recognition methods based on other deep networks, other deep networks applied to human action recognition include recurrent convolutional network (RNN) and so on. The RNN network memorizes the previous information and applies it to the calculation of the current output. It can process sequence data of any length, and realize feature learning and recognition and classification through the ordered loop learning of the input sequence, which is widely used in the field of natural language processing. It has achieved good applications, but needs to be further improved in human action recognition.

To sum up, in the existing computer vision, action recognition methods based on convolutional neural networks or other deep networks lack the representation of multi-modal spatiotemporal information of long video sequences and the mutual learning of multi-modal features, so that action recognition can be achieved. The accuracy still needs to be improved.

This application will realize the processing and recognition of user action classification based on the representation of multimodal spatiotemporal information of long video sequences and the mutual learning of multimodal features, which further improves the accuracy of user action recognition. The technical solutions of the present application will be described in detail below with reference to the drawings and specific embodiments.

Please refer to FIG. 1 , which is a schematic flowchart of a motion recognition method provided by an embodiment of the present application. The execution body of the method may be an independent terminal, such as a mobile phone, a computer, a multimedia device, a streaming media device, a monitoring device and other terminal devices; It can also be an integrated module in the terminal device, which can be implemented as a certain function in the terminal device. The following describes an example of the method being executed by a terminal device, but the embodiment of the present application is not limited to this. As shown in Figure 1, the method includes:

Step S101, acquiring video data of the action to be recognized, where the video data includes a first video sequence and a second video sequence.

In some embodiments, the video data is a sequential multi-frame image sequence combined in time, a sequence of all image frames of the entire video for which the action is to be identified. The terminal device can acquire the video data of the action to be recognized in real time through the RGB-D camera device; the video data of the action to be recognized can also be the video data pre-stored in the terminal device. The first video sequence and the second video sequence are video frame sequences of two different modalities, that is, different feature representations of the same piece of video data, for example, a color video sequence in RGB format and a video sequence represented by depth information, respectively. , video sequence or skeleton sequence in optical flow graph format, etc.

The detailed description is given by taking as an example that the first video sequence is a color video sequence in RGB format and the second video sequence is a video sequence represented by depth information. A color video sequence is a multi-frame image sequence in RGB format, that is, a color image sequence in which pixel information in each frame of image is represented by three colors of red, green, blue, and RGB; a video sequence represented by depth information is represented by depth values. The depth image sequence of the pixel point information in the frame image, the image depth of each frame image determines the possible color number or possible gray level of each pixel point of the image.

It should be noted that, by setting the shooting parameters of the shooting device, the acquired sequence frames of the first video sequence and the second video sequence can be in a one-to-one correspondence in time sequence, that is, each video frame of the first video sequence at the same moment. Corresponding to each video frame of the second video sequence, for example, set to shoot 20 frames per second, the first video sequence and the second video sequence of the same video can include the same number of video frames; thus, they can be represented by different feature quantities. Spatiotemporal information of video frames at the same time.

Among them, the spatiotemporal information includes the information of the temporal dimension and the information of the spatial dimension of the video frame sequence; the information of the temporal dimension is represented at different time points corresponding to a video frame, and the continuous video frame sequence constitutes a dynamic effect by the continuity of time; the spatial dimension The information can be expressed as texture information or color information of each video frame. For example, a video frame of a color video sequence in RGB format is represented by a 3-channel*widthW*heightH matrix, and the elements in the three channels represent the color information of each pixel in the video frame; the video represented by depth information For the video frames of the sequence, the depth information is measured by length units (such as millimeters, etc.). In order to facilitate computer processing, the distance information representing the depth and the grayscale information are converted correspondingly, and the matrix is obtained in the form of 1 channel * width W * height H Represented grayscale image.

In addition, the sequence frames of the first video sequence and the sequence frames of the second video sequence correspond one-to-one in time sequence, and are the same piece of video data. For example, by setting the shooting parameters of the shooting device, within the same shooting time, the first The video sequence includes 50 frames of images, and the second video frame also includes 50 frames of images.

In some embodiments, if the first video sequence is color video data in RGB format, it can be acquired by a camera in RGB format; if the second video sequence is a video sequence represented by depth information, it can be acquired by a depth camera; two The shooting parameters set by the various cameras may be the same, and the shooting parameters for the same target are shot in the same time period, and there is no specific limitation here.

The terminal device can be a device integrated with the camera, and the terminal device can directly obtain the video data of the action to be recognized through the camera device; the terminal device can also be a device that is separate from the camera device, and the terminal device can be wired or wirelessly connected to the camera device. The camera device is connected in communication to obtain video data of the action to be recognized. The action to be identified may be a human action or an activity action, or an animal action or action action, without any specific limitation.

In the above embodiment, the terminal device obtains the image frame sequence of the entire video data of the action to be recognized, records the multi-modal underlying features of the video data, and makes good use of the first video sequence and the second video sequence. The features of spatiotemporal information of different modalities provide a basis for the subsequent neural network model to learn various possibilities for features, and enhance the neural network model's ability to express and recognize image features.

Step S102: Compress the first video sequence and the second video sequence respectively to obtain a first motion picture corresponding to the first video sequence and a second motion picture corresponding to the second video sequence.

In some embodiments, in order to perform feature learning on the entire video data of the action to be recognized, the terminal device compresses the multi-frame images of the first video sequence and the multi-frame images of the second video sequence respectively, so as to obtain rich spatial and temporal data. A first motion map and a second motion map of the information. The feature representation of the first motion picture is different from the feature representation of the second motion picture, and is the representation of different underlying features of the same video, that is, the videos in the first video sequence and the second video sequence are represented by different image information. Image features of the frame.

The spatiotemporal information of the first motion picture includes the spatiotemporal information of all video frames of the first video sequence, and the spatiotemporal information of the second motion picture includes the spatiotemporal information of all the video frames of the second video sequence; for example, the RGB video sequence and the depth video sequence are combined. The temporal dimension information and spatial dimension information of the image are compressed and expressed as a single three-channel image and a single-channel image, respectively, showing dynamic effects and information such as color and texture.

In the actual calculation process, each video frame of the first video sequence corresponds to a feature matrix, and each video frame of the second video sequence corresponds to a feature matrix; for example, the first video sequence or the second video sequence may respectively include T frame images , the feature matrix corresponding to each frame of image is It, then the feature matrix set of the first video sequence or the feature matrix set of the second video sequence can be expressed as <I ₁ ,I ₂ ,I ₃ ,..., _I _T >, where I ₁ is the feature matrix of the first frame image in the video sequence arranged in time series, and so on, I _T is the characteristic matrix of the T-th frame image in the video sequence arranged in time series.

In some embodiments, the first video sequence and the second video sequence are respectively subjected to compression processing, and multiple frames of images of the video sequence are compressed and synthesized into an image, which contains feature information representing actions through time and space, which can be called is a motion map, so as to obtain a paired first motion map and a second motion map containing the spatiotemporal information of the entire video sequence; that is, the feature matrices of multiple frames of images are combined in one image to represent, so that all the video sequences in the video sequence can be obtained. Features of video frames.

Exemplarily, the first motion picture may be an image synthesized by compression of frames of a video sequence in RGB format, and the second motion picture may be an image synthesized by compression of a video sequence represented by depth information; the first motion picture and the second motion picture may also be It is the compressed and synthesized image of the video sequences corresponding to other modalities respectively.

In some embodiments, the first video sequence is compressed to obtain a first motion map corresponding to the first video sequence, including:

A1. Obtain the feature matrix of each video frame in the first video sequence;

A2. According to the time sequence of the video frames in the first video sequence, compress the feature matrix to obtain a feature matrix for representing the first motion image.

In some embodiments, the first video sequence includes multiple frames of images, and each frame of image corresponds to a feature matrix; if the first video sequence is color video data in RGB format, the feature matrix of each frame of the first video sequence is a matrix of 3 channels * width W * height H, where width W and height H are in pixels, and the elements in the feature matrix correspond to pixels. The value of each element in the feature matrix represents the feature of the pixel at the corresponding position, such as a color image in RGB format, and each element represents the feature value of each pixel in the three channels of red R, green G, and blue B respectively.

In some embodiments, each image frame of the first video sequence corresponds to a feature matrix, the elements at the same position in the feature matrices of all video frames are added, and then divided by the total frames of the video frames of the first video sequence number, obtain the element value at each position in the feature matrix, and round each element value, for example, 2.6 is rounded down to obtain 2, and the feature matrix of the first motion image corresponding to the first video sequence is obtained.

As shown in FIG. 2 , a schematic diagram of video data compression processing provided by an embodiment of the present application, when a video sequence is color video data in RGB format, the RGB video sequence is compressed to obtain a corresponding RGB motion picture, and a multi-frame image is The spatio-temporal information of , is synthesized into the spatio-temporal information of a motion map. The feature matrix of the motion map corresponding to the RGB video sequence may be a 3*W*H matrix. It can be calculated by the following formula:

Wherein, MI is the feature matrix of the corresponding motion picture of the first video sequence, T is the total number of frames of the first video sequence, Iτ is the feature matrix of the τth frame image in the first video sequence, and the value range of _τ is [1 , an integer of T].

In addition, the value range of an element in the feature matrix of each frame of image of the first video sequence may be an integer of [0, 255], and the value of each element in the feature matrix of the motion image MI after the compression processing of the first video sequence is the value of each element The range is also an integer in the range [0,255].

In some embodiments, the second video sequence is compressed to obtain a second motion image corresponding to the second video sequence, including:

B1, performing grayscale processing on the second video sequence to obtain a grayscale sequence frame corresponding to the second video sequence;

B2. According to the time sequence of the grayscale sequence frame, compress the feature matrix of the grayscale sequence frame to obtain a feature matrix used to represent the second motion image.

In some embodiments, the second video sequence includes multiple frames of images, and each frame of image corresponds to a feature matrix; if the second video sequence is an image sequence in which each video frame is represented by depth information, each The feature matrix of a frame image is a matrix of 1 channel * width W * height H, where width W and height H are in pixels, and elements in the feature matrix correspond to pixels. The value of each element in the feature matrix represents the feature of the pixel at the corresponding position. Since the second video sequence is an image sequence represented by depth information, the depth map of each frame in the second video sequence can be gray-scaled, and the depth information of each pixel in the depth map can be converted by mapping [0, 255], The grayscale image of the video frame is obtained, and the value of each element in the feature matrix of the grayscale image is an integer in the range of [0,255].

Exemplarily, the value of the video sequence represented by depth information may be 0 to 10000mm, while the representation range of images in computer vision is [0, 255], so the video sequence represented by depth information needs to be scaled to match the visual representation. The value range is to map and convert the video sequence represented by the depth information to a grayscale image. Among them, there are many ways of scaling. Assuming that the video sequence represented by the depth information is a 1*W*H matrix, and setting the difference between the maximum value and the minimum value of all elements as max-min, the matrix of each depth image in the video sequence is Elements are scaled and rounded. For example: assuming that the maximum depth value max-minimum depth value min=10000, and the value of a certain element is 7580, the corresponding value of this element after the operation is (7580/10000)*255=193.29, and then rounded up to get 193, that is, the The corresponding element value is 193, thus realizing the conversion to grayscale image.

In some embodiments, the compression process of the second video sequence is similar to that of the first video sequence. After grayscale processing is performed on each frame of the second video sequence, a feature matrix of the grayscale image is obtained. Add the elements at the same position of the feature matrix of the grayscale images corresponding to all video frames in , and divide by the total number of video frames of the second video sequence to obtain the element value at each position in the feature matrix. The value of each element is rounded to obtain the feature matrix of the motion map corresponding to the second video sequence.

As shown in FIG. 2 , a schematic diagram of video data compression processing provided by an embodiment of the present application, when a video sequence is a video sequence represented by depth information, the depth video sequence is subjected to grayscale processing to obtain a grayscale image corresponding to the depth video sequence , compress the grayscale image to obtain the corresponding depth motion map, and synthesize the spatiotemporal information of multiple frames of images into the spatiotemporal information of one motion image. The feature matrix of the motion map corresponding to the depth video sequence may be a 1*W*H matrix. It can be calculated by the following formula:

Among them, MJ is the feature matrix of the motion picture corresponding to the second video sequence, N is the total number of frames of the second video sequence, In is the feature matrix of the nth frame image in the second video sequence, and the value range of _n is [1 , an integer of N]. N and T may be equal, and the values of n and τ may be equal, that is, the video frames of the first video sequence and the video frames of the second video sequence correspond one-to-one in time sequence.

In addition, the value range of an element in the feature matrix of each frame of grayscale image corresponding to the second video sequence may be an integer of [0,255], and the value of each element in the feature matrix of the motion image MJ corresponding to the second video sequence The range can be an integer in the range [0,255].

It should be noted that the video frames in the first video sequence in the RGB format and the video frames in the second video sequence represented by depth information may be in one-to-one correspondence. The grayscale image sequence obtained by performing grayscale processing on the video frames in the second video sequence represented by the depth information is also in one-to-one correspondence with the video frames in the first video sequence in the RGB format.

Step S103, inputting the first motion map and the second motion map into the trained dual-stream neural network model, and performing interactive learning on the features of the first motion map and the second motion map through the trained dual-stream neural network model to obtain the first motion map. A first predictor for a video sequence and a second predictor for a second video sequence.

In some embodiments, the dual-stream neural network model is an overall model including two independent convolutional neural network models and a routing module. The dual-stream neural network model includes two inputs and two outputs. Among them, the two inputs correspond to the feature information of the two modalities of the video data respectively, and the two outputs correspond to the prediction results of the input information of the two modalities respectively.

As shown in FIG. 3 , a schematic diagram of the network architecture of the dual-stream neural network model provided in the embodiment of the present application, the dual-stream neural network model includes two independent convolutional neural network models and routing modules, and the inputs of the two convolutional neural network models are respectively are the first motion map and the second motion map; the convolutional neural network model of each channel includes multiple convolutional layers, such as the convolutional module Conv1, the convolutional module Conv2_x, the convolutional module Conv5_x and the fully connected layer, wherein the convolutional module Conv2_x and convolution module Conv5_x respectively represent a total convolution module, and a total convolution module may include a number of convolution layers or convolution calculation units. After each convolution module of the two-way convolutional neural network model, the output of the previous module is interactively learned through the routing module, and the output of the routing module is superimposed with the output of the previous convolution module as the output of the next convolution module. Input, the mid-level interaction features of different modalities in the dual-stream neural network model are learned through the routing module.

Among them, the basic network of the two-way convolutional neural network model can be a residual network (ResNet). Due to the high modularity of the residual network, each module in the residual network can be used as the basic module for the first motion map and the second. The feature information of different modalities of the motion map is used for model training and interactive learning of features. The dual-stream neural network model optimizes and trains the model through dual loss functions.

Exemplarily, the basic network model of the dual-stream neural network model can be a deep network model such as Inception, ImageNet, TSN, and dual-stream network; the parameters of the basic network model are trained and adjusted by fine-tuning; the network model can also be designed as needed to adjust the parameters. Training set adjustment. After learning the features of moving images of different modalities through the dual-stream neural network model, the joint optimization training is performed through the dual-loss function to obtain the dual-stream high-level features of the modalities corresponding to the input image features of different modalities; for example, the input modalities are For moving images in RGB format and moving images represented by depth information, dual-stream high-level features of two modalities of RGB format and depth information can be obtained.

In some embodiments, the two inputs may include multiple channel inputs; for example, if one of the inputs is an RGB motion image, the input may include three channel inputs, corresponding to the features of the red R channel of the input RGB motion image respectively. matrix, feature matrix for the green G channel, and feature matrix for the blue B channel.

In some embodiments, the trained dual-stream neural network model includes a first neural network model, a second neural network model, and a routing module, and the routing module is disposed between the intermediate convolution module of the first neural network model and the second neural network model. between intermediate convolution modules; the input of the first neural network model is the first motion map, and the output is the first prediction result of the first video sequence; the input of the second neural network model is the second motion map, and the output is the the second prediction result of the second video sequence; the routing module is used for, between the intermediate convolution module of the first neural network model and the intermediate convolution module of the second neural network model, for each of the two-stream neural network model The output features of the layer convolution module are interactively learned.

As shown in FIG. 4 , the architecture diagram of the dual-stream neural network model provided by the embodiment of the present application, the first neural network model corresponds to one channel of input and output, and the second neural network model corresponds to another channel of input and output. The first motion picture input into the first neural network model can be an RGB motion picture; the first prediction result output by the first neural network model is the identification result corresponding to the first video sequence, and the first video sequence can be an RGB video sequence in RGB format ; RGB motion pictures are obtained by compressing RGB video sequences in RGB format. The second motion map input into the second neural network model can be a depth motion map; the second prediction result output by the second neural network model is the recognition result corresponding to the second video sequence, and the second video sequence can be the depth represented by the depth information. Video sequence; the depth motion map is obtained by compressing the depth video sequence represented by depth information.

In the middle layer of the dual-stream neural network, it includes multiple convolution modules and multiple routing modules, such as the convolution module Conv1, the convolution module Conv2_x and the convolution module Conv5_x as shown in Figure 4; the routing module is set in the two-way convolution neural network. After each convolution module of the network model, the output of the previous module is interactively learned through the routing module. The output of the routing module is superimposed with the output of the previous convolution module as the input of the next convolution module. The module learns mid-level interaction features of different modalities in a two-stream neural network model.

In some embodiments, the intermediate convolution module of the first neural network model includes a first convolution module with a preset number of layers, and the intermediate convolution module of the second neural network model includes a second convolution module corresponding to the first convolution module. Convolution module.

As shown in Figure 4, the first motion map and the second motion map are input into the trained dual-stream neural network model, and the features of the first motion map and the features of the second motion map are interactively learned through the trained dual-stream neural network model. , obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence, including:

C1. Use the output of the first convolution module of the first layer and the output of the second convolution module as the input of the routing module of the first layer, and the routing module of the first layer performs feature interactive learning to obtain the first route output;

C2. Use the superposition result of the output of the first convolution module of the first layer and the output of the first route as the input of the first convolution module of the second layer, and the first convolution of the second layer The module performs feature learning to obtain the output of the first convolution module of the second layer;

C3. Use the superposition result of the output of the second convolution module of the first layer and the output of the first route as the input of the second convolution module of the second layer, and the second convolution module of the second layer The module performs feature learning to obtain the output of the second convolution module of the second layer;

C4. Use the output of the first convolution module of the second layer and the output of the second convolution module of the second layer as the input of the routing module of the second layer, and the routing module of the second layer performs Feature interactive learning to obtain the second routing output.

In some embodiments, a convolution module includes a number of convolution layers or convolution calculation units; a convolution layer can be a set of parallel feature maps, by sliding different convolutions on the input image and performing certain At each sliding position, an element-corresponding product and sum operation is performed between the convolution kernel and the input image to project the information in the receptive field to an element in the feature map. Among them, the size of the convolution kernel is smaller than the size of the input image, and overlaps or acts on the input image in parallel. All elements in the feature map output by the convolution module of each layer in the middle of the dual-stream neural network model are obtained through a convolution The kernel is calculated.

In addition, the dual-stream neural network model further includes a fully connected layer, a first loss function and a second loss function. As shown in Figure 4, the features output by the convolution module Conv5_x are used as the input of a fully connected layer, and the output features of the routing module of the last layer are used as the input of a fully connected layer. The results of the two fully connected layers are added as The output of the total fully connected layer, the first prediction result and the second prediction result are obtained.

In some embodiments, the routing module includes: a first convolution unit, a first normalization unit, a first activation unit, a second convolution unit, a second normalization unit, and a second activation unit; The first convolution unit, the first normalization unit, the first activation unit, the second convolution unit, the second normalization unit, and the second activation unit, sequentially output the convolution calculation module of the first neural network model. The feature matrix and the feature matrix output by the convolution calculation module of the second neural network model are interactively learned to obtain the feature matrix output by the routing module.

As shown in FIG. 5 , a schematic structural diagram of a routing module provided by an embodiment of the present application is shown. The routing module includes two layers of convolution units, two layers of normalization units, and two layers of activation units; they can be the first convolution unit Conv1D, the first normalization unit Batch Normlization, the first activation unit ReLU, and the second convolution unit. Unit Conv1D, second normalization unit Batch Normlization, second activation unit ReLU. The output of the two-way convolution module of each layer of the intermediate convolution module of the dual-stream neural network model is used as the input of the corresponding routing module, and the output of each layer of the routing module is used as the input of the next layer of convolution module or the whole. Input to the connection layer. Among them, the routing module can be a 1*1 convolution-based computing unit; the output of the two-way convolution module of the previous layer is output to the convolution of the subsequent layer after learning and redirection of the 1*1 convolution module. The output of the two-way convolution module can be the information flow of multi-modal image features, such as the information flow of RGB format and the information flow of depth image features.

In some embodiments, the first neural network model includes a first loss function, and the second neural network model includes a second loss function; the first neural network model, the second neural network model and the routing module are trained through video sample data, Adjust the parameters of the first neural network model, the parameters of the second neural network model, and the parameters of the routing module respectively according to the first loss function and the second loss function; if the first loss function and the second loss function meet the preset threshold, stop The parameters of the first neural network model, the parameters of the second neural network model and the routing module are trained to obtain a trained dual-stream neural network model.

In some embodiments, the dual-stream neural network model is optimized and trained with dual loss functions. According to the output result of the fully connected layer of the convolutional neural network of the first channel, the parameters of the convolutional neural network of the first channel are trained and adjusted through the first loss function; according to the fully connected layer of the convolutional neural network of the second channel The output result is used to train and adjust the parameters of the convolutional neural network of the second channel through the second loss function; at the same time, the parameters of the routing module are trained and adjusted through the first loss function and the second loss function.

Step S104, based on the first prediction result and the second prediction result, determine the classification result of the action to be recognized.

In some embodiments, the first prediction result and the second prediction result are multimodal dual-stream high-level features output by the trained neural network model. Feature fusion is performed on the dual-stream high-level features to obtain the final output result in the network architecture of the dual-stream neural network model. The final output result is a one-dimensional score vector (probability), and the final classification result is determined according to the highest probability in the score vector; that is, the category corresponding to the highest score is the classification result of the action to be recognized.

In some embodiments, based on the first prediction result and the second prediction result, determining the classification result of the action to be recognized includes:

D1. Feature fusion is performed on the first prediction result and the second prediction result to obtain the probability distribution of action categories;

D2. Use the action category with the highest probability in the probability distribution as the classification result of the action to be identified.

In some embodiments, feature fusion is a calculation process in the network architecture of the dual-stream neural network model, that is, after the dual-stream neural network model obtains the feature information of the RGB format information flow and the depth information flow, it will be fused, and the fusion will be performed after the fusion. Probability mapping, and finally category judgment. For example, the final output result is a one-dimensional score vector (probability), the score vector is a one-dimensional vector containing 10 elements, each element is the probability of 0 to 1, and the sum of the 10 elements is 1, assuming the first The maximum value of the two elements is 0.3, and the classification result of the action to be recognized is determined to be the second category.

Among them, the process of feature fusion can perform fusion calculation by performing point multiplication, weighted addition or maximum value of the two matrices finally output by the network architecture to obtain the final probability distribution, which is determined according to the category corresponding to the maximum value in the probability distribution. The type of action to be recognized.

Through the embodiments of the present application, the terminal device can obtain the video data of the action to be recognized, the video data includes the first video sequence and the second video sequence, and the first video sequence and the second video sequence are respectively compressed to obtain the first motion picture and the second video sequence. The second motion map, which provides a richer spatiotemporal representation of the video data, makes the information representation more complete and the features richer; thus, the first motion map and the second motion map are used as the input of the dual-stream neural network model, and the multi-modality model is analyzed by the neural network model. The interactive learning of state image features improves the accuracy of action recognition.

It should be understood that the above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; modifications to the technical solutions recorded in the foregoing embodiments, or equivalent replacements to some of the technical features therein; for example, adding a model The dimension of multi-modal video sequence is added as the input of the model, and the dual-stream neural network model is modified into a multi-channel independent convolutional neural network model and routing module to interact with the features of video sequences of multiple modalities. Learning, etc., belong to similar inventive concepts, and the essence of the corresponding technical solutions does not deviate from the spirit and scope of the technical solutions of the embodiments of the present application, and should be included within the protection scope of the present application.

It should also be understood that the size of the sequence numbers of the steps in the above-mentioned embodiments does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. .

Corresponding to the motion recognition methods described in the above embodiments, FIG. 6 shows a structural block diagram of the motion recognition apparatus provided by the embodiments of the present application. For convenience of description, only the parts related to the embodiments of the present application are shown.

Referring to Figure 6, the device includes:

an acquisition unit 61, configured to acquire video data of the action to be identified, the video data including a first video sequence and a second video sequence;

A processing unit 62, configured to perform compression processing on the first video sequence and the second video sequence respectively to obtain a first motion map corresponding to the first video sequence and a second motion corresponding to the second video sequence picture;

The computing unit 63 is used for inputting the first motion map and the second motion map into the trained dual-stream neural network model, and through the trained dual-stream neural network model, the features of the first motion map and all parameters are analyzed. performing interactive learning on the features of the second motion map to obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence;

An output unit 64, configured to determine a classification result of the to-be-recognized action based on the first prediction result and the second prediction result.

It should be noted that the information exchange, execution process and other contents between the above-mentioned devices/units are based on the same concept as the method embodiments of the present application. For specific functions and technical effects, please refer to the method embodiments section. It is not repeated here.

Those skilled in the art can clearly understand that, for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example. Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above. Each functional unit and module in the embodiment may be integrated in one processing unit, or each unit may exist physically alone, or two or more units may be integrated in one unit, and the above-mentioned integrated units may adopt hardware. It can also be realized in the form of software functional units. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing from each other, and are not used to limit the protection scope of the present application. For the specific working processes of the units and modules in the above-mentioned system, reference may be made to the corresponding processes in the foregoing method embodiments, which will not be repeated here.

FIG. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in FIG. 7 , the terminal device 7 in this embodiment includes: at least one processor 70 (only one is shown in FIG. 7 ), a memory 71 , and a processor stored in the memory 71 and can be processed in the at least one processor The computer program 72 running on the processor 70, when the processor 70 executes the computer program 72, implements the steps in any of the foregoing embodiments of the method for identifying each training board.

The terminal device 7 may include, but is not limited to, a processor 70 and a memory 71 . Those skilled in the art can understand that FIG. 7 is only an example of the terminal device 7, and does not constitute a limitation on the terminal device 7, and may include more or less components than the one shown, or combine some components, or different components , for example, may also include input and output devices, network access devices, and the like.

The so-called processor 70 may be a central processing unit (Central Processing Unit, CPU), and the processor 70 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuits) , ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 71 may be an internal storage unit of the terminal device 7 in some embodiments, such as a hard disk or a memory of the terminal device 7 . In other embodiments, the memory 71 may also be an external storage device of the terminal device 7, such as a plug-in hard disk equipped on the terminal device 7, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Further, the memory 71 may also include both an internal storage unit of the terminal device 7 and an external storage device. The memory 71 is used to store operating systems, application programs, bootloaders (BootLoader), data, and other programs, such as program codes of the computer programs, and the like. The memory 71 may also be used to temporarily store data that has been output or will be output.

Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps in the foregoing method embodiments can be implemented.

The embodiments of the present application provide a computer program product, when the computer program product runs on a mobile terminal, the steps in the foregoing method embodiments can be implemented when the mobile terminal executes the computer program product.

The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the present application realizes all or part of the processes in the methods of the above embodiments, which can be completed by instructing the relevant hardware through a computer program, and the computer program can be stored in a computer-readable storage medium. When executed by a processor, the steps of each of the above method embodiments can be implemented. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form, and the like. The computer-readable medium may include at least: any entity or device capable of carrying the computer program code to the photographing device/terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electrical carrier signals, telecommunication signals, and software distribution media. For example, U disk, mobile hard disk, disk or CD, etc. In some jurisdictions, under legislation and patent practice, computer readable media may not be electrical carrier signals and telecommunications signals.

In the foregoing embodiments, the description of each embodiment has its own emphasis. For parts that are not described or described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

In the embodiments provided in this application, it should be understood that the disclosed apparatus/network device and method may be implemented in other manners. For example, the apparatus/network device embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units. Or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the above-mentioned embodiments, those of ordinary skill in the art should understand that: it can still be used for the above-mentioned implementations. The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the application, and should be included in the within the scope of protection of this application.

Claims

An action recognition method based on feature interaction learning, characterized in that it includes:

acquiring video data of an action to be identified, the video data including a first video sequence and a second video sequence;

compressing the first video sequence and the second video sequence respectively, to obtain a first motion picture corresponding to the first video sequence and a second motion picture corresponding to the second video sequence;

The first motion map and the second motion map are input into the trained dual-stream neural network model, and the features of the first motion map and the second motion map are analyzed by the trained dual-stream neural network model. The feature is interactively learned to obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence output by the trained dual-stream neural network model;

Based on the first prediction result and the second prediction result, a classification result of the action to be recognized is determined.
The method according to claim 1, wherein the compressing the first video sequence to obtain the first motion picture corresponding to the first video sequence comprises:

Obtain the feature matrix of each video frame in the first video sequence;

According to the time sequence of the video frames in the first video sequence, the feature matrix of each video frame is compressed and calculated to obtain a feature matrix for representing the first motion image.
The method of claim 1, wherein the compressing the second video sequence to obtain a second motion picture corresponding to the second video sequence comprises:

performing grayscale processing on the second video sequence to obtain a grayscale sequence frame corresponding to the second video sequence;

According to the time sequence of the video frames in the second video sequence, the feature matrix of the grayscale sequence frame is compressed and calculated to obtain the feature matrix used to represent the second motion image.
The method of claim 1, wherein the trained dual-stream neural network model comprises a first neural network model, a second neural network model and a routing module, and the routing module is set in the first neural network between the intermediate convolution module of the model and the intermediate convolution module of the second neural network model;

The input of the first neural network model is the first motion map, and the output is the first prediction result of the first video sequence;

The input of the second neural network model is the second motion map, and the output is the second prediction result of the second video sequence;

The routing module is configured to, between the intermediate convolution module of the first neural network model and the intermediate convolution module of the second neural network model, compare the features of the first motion map and the second motion Interactive learning of graph features.
The method of claim 4, wherein the intermediate convolution module of the first neural network model comprises a first convolution module with a preset number of layers, and the intermediate convolution module of the second neural network model comprises a second convolution module corresponding to the first convolution module;

The first motion map and the second motion map are input into the trained dual-stream neural network model, and the features of the first motion map and the second motion are analyzed by the trained dual-stream neural network model. The features of the graph are interactively learned to obtain the first prediction result of the first video sequence and the second prediction result of the second video sequence, including:

The output of the first convolution module of the first layer and the output of the second convolution module of the first layer are used as the input of the routing module of the first layer, and the feature interactive learning is performed by the routing module of the first layer, and the first layer is obtained. a route output;

The superposition result of the output of the first convolution module of the first layer and the output of the first route is used as the input of the first convolution module of the second layer, and the first convolution module of the second layer performs feature learning to obtain the output of the first convolution module of the second layer;

The superposition result of the output of the second convolution module of the first layer and the output of the first route is used as the input of the second convolution module of the second layer, and the second convolution module of the second layer performs feature learning to obtain the output of the second convolution module of the second layer;

The output of the first convolution module of the second layer and the output of the second convolution module of the second layer are used as the input of the routing module of the second layer, and the feature interaction is performed by the routing module of the second layer learn, get the second route output;

Wherein, the first convolution module of the first layer and the first convolution module of the second layer are two adjacent convolution layers in the middle convolution module of the first neural network model; The second convolution module of one layer and the second convolution module of the second layer are two adjacent convolution layers in the middle convolution module of the second neural network model; The routing module and the routing module of the second layer are two adjacent computing modules.
The method of claim 4, wherein the routing module comprises: a first convolution unit, a first normalization unit, a first activation unit, a second convolution unit, a second normalization unit, the second activation unit;

Through the first convolution unit, the first normalization unit, the first activation unit, the second convolution unit, the second normalization unit, the first The second activation unit performs interactive learning on the feature matrix output by the convolution calculation module of the first neural network model and the feature matrix output by the convolution calculation module of the second neural network model in turn, and obtains the output of the routing module. feature matrix.
The method according to any one of claims 1 to 6, wherein the determining the classification result of the to-be-recognized action based on the first prediction result and the second prediction result comprises:

Perform feature fusion on the first prediction result and the second prediction result to obtain a probability distribution of action categories;

The action category with the highest probability in the probability distribution is used as the classification result of the action to be identified.
The method of claim 4, wherein the first neural network model includes a first loss function, and the second neural network model includes a second loss function;

The first neural network model, the second neural network model and the routing module are trained through sample video data, and the first neural network is adjusted according to the first loss function and the second loss function respectively parameters of the model, parameters of the second neural network model, and parameters of the routing module;

If the first loss function and the second loss function meet the preset threshold, stop the training of the parameters of the first neural network model, the parameters of the second neural network model and the routing module, and obtain The trained dual-stream neural network model.
A terminal device, characterized by comprising a memory, a processor, and a computer program stored in the memory and running on the processor, the processor implementing the computer program according to claims 1 to 1 when the processor executes the computer program. 8. The method of any one.
A computer-readable storage medium storing a computer program, characterized in that, when the computer program is executed by a processor, the method according to any one of claims 1 to 8 is implemented.