[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN110188239B - Double-current video classification method and device based on cross-mode attention mechanism - Google Patents

Double-current video classification method and device based on cross-mode attention mechanism Download PDF

Info

Publication number
CN110188239B
CN110188239B CN201910294018.XA CN201910294018A CN110188239B CN 110188239 B CN110188239 B CN 110188239B CN 201910294018 A CN201910294018 A CN 201910294018A CN 110188239 B CN110188239 B CN 110188239B
Authority
CN
China
Prior art keywords
rgb
optical flow
branch
cross
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910294018.XA
Other languages
Chinese (zh)
Other versions
CN110188239A (en
Inventor
迟禄
严慧
田贵宇
穆亚东
陈刚
王成成
黄波
韩峻
糜俊青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongxing Technology Co ltd
Peking University
Nanjing University of Science and Technology
Original Assignee
Zhongxing Technology Co ltd
Peking University
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongxing Technology Co ltd, Peking University, Nanjing University of Science and Technology filed Critical Zhongxing Technology Co ltd
Publication of CN110188239A publication Critical patent/CN110188239A/en
Application granted granted Critical
Publication of CN110188239B publication Critical patent/CN110188239B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a double-current video classification method and device based on a cross-mode attention mechanism. Different from the traditional double-flow method, the information of two modes (even more modes) is fused before the result is predicted, so that the method is more efficient and sufficient, meanwhile, because the information interaction is carried out at an early stage, a single branch already has important information of another branch at a later stage, the precision of the single branch is equal to or even more than that of the traditional double-flow method, and the parameter quantity of the single branch is much less than that of the traditional double-flow method; compared with a non-local neural network, the attention module designed by the invention can cross the modes, and not only uses an attention mechanism in a single mode, but also has the effect equivalent to that of the non-local neural network under the condition that the two modes are the same.

Description

Double-current video classification method and device based on cross-mode attention mechanism
Technical Field
The invention relates to a video classification method, in particular to a double-current video classification method and device using an attention mechanism, and belongs to the field of computer vision.
Technical Field
With the rapid development of deep learning in the image field, a deep learning method is gradually introduced in the video field and certain achievement is achieved. However, the current technical level still has not achieved ideal effect, and the problems are mainly the following two aspects:
first, current techniques have not fully utilized dynamic information. Video differs from images in that the dynamic information from frame to frame is unique and important to video. For example, even for human beings, it is difficult to judge various fine-classified dances (such as tango and salsa dance) by only viewing one frame of image, and if motion trajectory information is added, the task becomes much easier. Likewise, the classification of some sports is also dependent on the motion trajectory.
Second, current techniques also have difficulty locating critical objects quickly and accurately. Attention mechanism has been widely used in natural language processing, but research in video classification is relatively lacking. With the attention mechanism, the neural network can filter out extraneous objects and focus more on critical objects. Such as "swords," classification is simplified if the key object "sword" is detected. In general, the moving object can attract the eyes of human beings, and the area often contains key information of video classification, such as "cake making" and "pizza making", and the key object "cake" or "pizza" is located near the moving hands.
There are many prior art attempts to solve both of the above problems. With respect to how to utilize dynamic information, there are two main types of current technologies: one is to design a neural network structure related to a time dimension, such as a Recurrent Neural Network (RNN), a three-dimensional convolutional neural network (3D-Conv), etc., and train a network structure capable of capturing information between frames in a data-driven manner; the other method is to explicitly use dynamic information, namely extracting optical flows, then using the optical flows to train a neural network branch separately, and performing weighted summation with the result of the RGB branch, which means that a relatively extensive dual-flow video classification technology is currently used. However, research on how to capture key clues, namely, introducing attention mechanism into video classification is relatively few, and a representative method is Non-local Neural Networks (Non-local Neural Networks), but the network only focuses on important information in a single modality, and there is no special modeling way for a "moving object".
Disclosure of Invention
The invention mainly provides a novel double-current video classification method based on a cross-modal attention mechanism, which can efficiently utilize multi-modal information to classify videos and pay attention to moving objects, so that the video classification is simpler and more efficient. The technology provided by the invention has universality and can be widely applied to the existing video classification problem and other multi-modal models.
The technical problems to be solved by the invention specifically comprise: 1. fully utilizing multi-mode information to classify videos; 2. key objects are concerned more, so that the video classification is more accurate; 3. higher accuracy is achieved using fewer parameters.
Different from the traditional double-flow method, the information of two modals (even more modals, such as extracted sound, an intermediate characteristic diagram extracted by using an object detection model and the like) is fused before the result is predicted, so that the method is more efficient and sufficient, meanwhile, because the information interaction is carried out at an early stage, a single branch already has important information of another branch at a later stage, the precision of the single branch is equal to or even more than that of the traditional double-flow method, and the parameter quantity of the single branch is much less than that of the traditional double-flow method; compared with a non-local neural network, the attention module designed by the invention can cross the modes, and not only uses an attention mechanism in a single mode, but also has the effect equivalent to that of the non-local neural network under the condition that the two modes are the same.
The invention discloses a double-current video classification method based on a cross-mode attention mechanism, which comprises the following steps of:
1) establishing a neural network structure of RGB branches and optical flow branches, wherein the neural network structure comprises a cross-modal attention module;
2) obtaining RGB and optical flow according to the video to be classified, and respectively inputting the RGB and optical flow into the neural network structure of the RGB branch and the optical flow branch;
3) for input RGB and optical flow, the neural network structures of the RGB branch and the optical flow branch carry out information interaction through a cross-modal attention module, and cross-modal information fusion is realized;
4) and carrying out video classification according to the result of information fusion obtained by the neural network structure of the RGB branch and the optical flow branch.
The cross-mode attention mechanism (cross-mode attention module) designed by the method mainly comprises three parts: key Value (Key), Query (Query), and Value (Value). The key value refers to indexes of all information, the query refers to index of query information, and the numerical value refers to all information. The cross-modal attention mechanism can be described as: generating a query from a current modality, generating a key value-numerical matching pair from another modality, and acquiring important information from the other modality according to the similarity degree of the query and the key value. Thus, the cross-modality attention mechanism is actually a process of selectively acquiring information from another modality according to the current modality, the acquired information is often weak or even missing in the current modality but is very important to the final result.
FIG. 1 is an example implementation of a cross-modal attention mechanism. X and Y represent inputs from the RGB branch and the optical flow branch, respectively. Q (query), K (dictionary) and V (value) are generated by convolution of X or Y by 1X1, the shape and size of the X or Y are already marked in the figure, and matrix transposition, deformation and other operations are carried out before matrix multiplication so that matrix multiplication can be carried out. Multiplying Q and K to obtain M which represents the attention weight distribution of each pixel on the whole feature map, multiplying M and V after obtaining the distribution M, namely selectively obtaining information Z from V, carrying out nonlinear transformation (for example, adopting an activation function such as ReLU) after obtaining Z, and carrying out residual error connection on the transformed result and the original input to obtain a final Output result (Output).
By this way of operation, at the current network stage, the RGB branches can traverse all the positions of the optical flow branches. Therefore, the RGB branch can integrate all the information of the optical flow branch and then selectively choose important information to fuse with the current information, rather than just performing weighted summation at the final stage (common dual-flow method). Meanwhile, as the input and output shapes of the operation are completely consistent and can process the input with any shape, the operation has good compatibility and can be inserted into almost any stage of all networks, thereby fully utilizing multi-scale information. In order to further improve the compatibility, residual error connection is added to the model, namely the residual error connection is directly added to the input before the operation on the result of the operation, so that the model added with the cross-modal attention module can be guaranteed not to have lower precision than the original model theoretically.
FIG. 2 is a network architecture designed by the present invention based on a dual-flow model, incorporating a cross-modal attention module. The two branches of the model are an RGB branch and a Flow branch respectively and are responsible for processing the surface characteristics and the dynamic information of the image respectively. The specific process is as follows:
step 1: initial network parameters. The parameters of the network are initialized with the pre-trained model on the ImageNet dataset and then trained on the kinetics dataset until convergence.
Step 2: and (6) data processing. The input to the network requires both RGB and optical flow inputs, for RGB, frames are directly cut from the original video and then scaled to the specified resolution (224x 224); for optical Flow, RGB images of two adjacent frames are extracted through a TVL1 optical Flow algorithm of a GPU version in OpenCV, a plurality of continuous frames (such as five continuous frames) of optical Flow are stacked together to be used as input of a Flow branch, and the resolution of the Flow branch is consistent with that of RGB;
step 3: after the data is obtained, the RGB and optical flows are input into two branches, respectively, which pass through a cross-modality attention module (i.e., the configuration shown in FIG. 1, with CMA in FIG. 2) during operation1~CMAnTo represent) to interact information, thereby achieving the effect of fully utilizing multi-modal information at multiple levels.
Step 4: the two branches eventually yield two results, which can also be weighted and summed as in the normal dual stream approach. Since the model can perform information fusion at an earlier stage, the accuracy of simultaneous prediction of two branches in a common dual-stream model can be achieved or even exceeded by performing video classification only by using the result of RGB branch, and at the moment, the optical flow branch does not need to perform subsequent operation (indicated by a dotted line in FIG. 2), a large number of parameters are saved, so that the model is very efficient. In addition, if the prediction precision is further improved by simultaneously adopting the two branch results of the model.
Based on the same inventive concept, the invention also provides a double-current video classification device based on a cross-mode attention mechanism corresponding to the method, which comprises the following steps:
the network construction module is responsible for establishing a neural network structure of the RGB branch and the optical flow branch, and comprises a cross-modal attention module;
the data processing module is responsible for obtaining RGB and optical flow according to the video to be classified and inputting the RGB and optical flow into the neural network structures of the RGB branch and the optical flow branch respectively;
the information fusion module is responsible for carrying out information interaction on input RGB and optical flow, and the neural network structures of the RGB branches and the optical flow branches through the cross-modal attention module to realize cross-modal information fusion;
and the video classification module is in charge of performing video classification according to the result of information fusion obtained by the neural network structure of the RGB branch and the optical flow branch.
Compared with the prior art, the invention has the advantages that:
(1) multi-mode information interaction is added, and multi-mode information can be fully utilized in multiple levels;
(2) the information of the two modes can be selectively selected by a cross-mode attention mechanism, so that the complementary information can be efficiently utilized, and the key object can be captured more accurately;
(3) the precision of the traditional double-flow method can be achieved or even exceeded by using less parameter quantity, and meanwhile, the results of two branches can be synthesized for prediction, so that the classification precision is further improved;
(4) the cross-modal attention module designed by the invention has good compatibility, is not in conflict with most of the prior art, can be almost inserted into any existing network architecture, and can stably improve the video classification precision.
Drawings
FIG. 1 is an exemplary diagram of a cross-modal attention module;
fig. 2 is a diagram of a video classification network according to the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.
1. Configuration of cross-modal attention module
The cross-modal attention module can process input of any dimension, and can ensure that the input and the output are consistent in shape, so that the cross-modal attention module has excellent compatibility. For example, the Q, K, V is obtained by 2-dimensional convolution operation of 1x1 (for 3-dimensional model, the convolution is 3-dimensional convolution operation of 1x1x 1), which performs dimensionality reduction operation in channel dimension while obtaining Q, K, V to reduce computational complexity and save GPU space. To further simplify the operation, the convolution operation may be preceded by a max-Pooling (max-Pooling) operation, which reduces the spatial dimension to 1/4. After obtaining Z, it needs to increase the dimension to be consistent with the input dimension through a convolution operation, and then it needs to initialize the parameters of BN to be zero through Batch Normalization, so that the initial state of the module has no influence on the operation result of the previous network.
2. Configuration of a network
Both branches are based on the ResNet-50 network. In order to utilize multi-scale and more accurate spatial information as much as possible while saving GPU video memory, 5 (or other numbers of) cross-mode attention modules are uniformly inserted in two stages of res3 and res4, and the configuration is also consistent with an nonlocal neural network. The RGB branch inputs only one frame of image, while the optical flow branch inputs a succession of five frames of optical flow images. The weights of the RGB branches are directly initialized by using the parameters trained on ImageNet, while the optical flow branches need to be changed properly because the input shapes of the optical flow branches are different from the input shapes of models trained on ImageNet, the convolution kernel parameters of the first layer of convolution network are averaged in the channel dimension, then the average value is copied five times to obtain convolution kernels of five channels, and the parameters of other layers can be directly copied, so that the parameters trained on ImageNet can be well migrated.
The network is based on a Temporal Segment Network (TSN) framework because the framework is capable of modeling long sequence relationships simply and efficiently. The whole video is divided into m segments on average, each segment randomly selects a frame as the input of the network, so that m results are obtained, and the final video prediction result is based on the average value of the m results.
3. Data processing
The original video data resolution is not completely uniform. Scaling them uniformly to 256x256 resolution size. The optical flow is extracted using the GPU version TVL1 algorithm in OpenCV, truncates its result at-20, and then scales between-1, 1. Data enhancement is also performed, such as random cropping, scaling, mirroring, etc., and it should be noted that the two branches are consistent for the same input data enhancement, e.g., if the upper left corner of the RGB image is cropped, the upper left corner of the optical flow is also cropped at the same position. In the temporal dimension, the RGB image corresponds to the first frame of the five consecutive frames of optical flow.
4. Network training
Since the convergence speed of the optical flow branches is generally slower than that of the RGB branches, the optical flow branches are first trained on a kinetics data set, which also helps the optical flow branches to provide more accurate information for the RGB branches. After that, iterative training is started, i.e. the RGB branch is optimized alternately with the optical flow branch. In the process of training the RGB branches, all parameters of the optical flow branches are frozen, including a cross-modal attention module in the optical flow branches, only the parameters of the RGB branches are updated, and vice versa when the optical flow branches are trained. The number of training sessions per iteration does not exceed 30 epochs. Practice has found that the precision of the just trained branch is often higher than that of the other branch, so that for the weighting of the two branch results, the higher precision branch is given a higher weight (5:1), which results in higher precision. Typically, one iteration can achieve very high accuracy.
In the training process, a standard cross entropy loss function and a random gradient descent optimization method are adopted. The batch size is 128, BN parameters are updated simultaneously in the training process, and synchronous BN is adopted in order to obtain more accurate BN statistic. The learning rate is initialized to 0.01, and when the training accuracy reaches a stable stage, the learning rate is reduced to one tenth of the current learning rate. To prevent overfitting, dropout of 0.7 and weight decay of 0.0005 are added and K is set to 3 during training.
In the testing process, the four corners and the center position of the image are cut out and inverted, so that 10 samples are obtained, 10 results are obtained when the 10 samples are input into the network, and the 10 results are averaged to obtain the final video classification result. K in TSN is set to 25.
5. Transfer learning
The network structure provided by the invention is trained on the basis of kinetics, wherein the kinetics has a class 400, other data sets, such as ucf101, have only a class 101, and the class 101 and the kinetics 400 have overlapping parts and non-overlapping parts. In order to migrate the model to a new video classification, only the last full link layer needs to be finely adjusted on a new data set, so that a good effect can be achieved. Similar methods can be adopted for other data sets, and the model has better migration capability.
ResNet50 is a model without a cross-modal attention module added, and CMA-ResNet50 is a model with the module added on the basis of the cross-modal attention module added. -R denotes the RGB branch, -S denotes the fusion of the RGB branch with the optical flow branch. Table 1 below is an experiment with ResNet50 as the backbone network:
TABLE 1 Experimental results with ResNet50 as backbone network
Model (model) Accuracy (%)
ResNet50-R 67.73
ResNet50-S 71.21
CMA-ResNet50-R 72.17
CMA-ResNet50-S 72.62
P3D is a 3-dimensional convolution neural network model, and CMA-P3D is a model added with a 3-dimensional cross-modal attention module on the basis of the model. Table 2 below is an experiment with P3D as the backbone network:
TABLE 2 Experimental results with P3D as the backbone network
Model (model) Accuracy (%)
P3D-R 71.50
P3D-S 74.62
CMA-P3D-R 74.86
CMA-P3D-S 75.98
From the above two tables, it can be seen that the accuracy of the two-dimensional or three-dimensional cross-modal attention module can be stably improved, and the accuracy of the RGB branch ratio added with the module is higher than that of the dual-flow model (ResNet50-S/P3D-S) in the comparison experiment, so that the accuracy of the dual-flow model added with the module is further improved.
Another embodiment of the present invention provides a dual-stream video classification apparatus based on a cross-mode attention mechanism, which includes:
the network construction module is responsible for establishing a neural network structure of the RGB branch and the optical flow branch, and comprises a cross-modal attention module;
the data processing module is responsible for obtaining RGB and optical flow according to the video to be classified and inputting the RGB and optical flow into the neural network structures of the RGB branch and the optical flow branch respectively;
the information fusion module is responsible for carrying out information interaction on input RGB and optical flow, and the neural network structures of the RGB branches and the optical flow branches through the cross-modal attention module to realize cross-modal information fusion;
and the video classification module is in charge of performing video classification according to the result of information fusion obtained by the neural network structure of the RGB branch and the optical flow branch.
The invention is not limited to ResNet-50 neural networks, and can be applied to various neural networks (such as VGG, DenseNet, SENET and the like) and also can be applied to 3D neural networks (such as I3D, P3D and the like). Meanwhile, the operation across the interior of the modal attention module is not limited to the implementation described above, for example, in the process of generating the key value/query/value, a more complex operation may be used instead of the 1 × 1 convolution operation (for example, a multilayer convolution operation is superimposed), and after obtaining Z, a more complex operation may also be performed (a more multilayer convolution operation may also be used). For the merging mode with the main branch, the residual error connection is adopted, and other modes, such as splicing with the characteristics of the main branch, and the like, can also be adopted.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the principle and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (8)

1. A double-current video classification method based on a cross-modal attention mechanism is characterized by comprising the following steps:
1) establishing a neural network structure of RGB branches and optical flow branches, wherein the neural network structure comprises a cross-modal attention module;
2) obtaining RGB and optical flow according to the video to be classified, and respectively inputting the RGB and optical flow into the neural network structure of the RGB branch and the optical flow branch;
3) for input RGB and optical flow, the neural network structures of the RGB branch and the optical flow branch carry out information interaction through a cross-modal attention module, and cross-modal information fusion is realized; the cross-modal attention module comprises a key value, a query and a numerical value, generates the query from the current modality, generates a key value-numerical value matching pair from the other modality, and acquires important information from the other modality according to the similarity degree of the query and the key value;
4) performing video classification according to the result of information fusion obtained by the neural network structure of the RGB branch and the optical flow branch;
the neural network structure of the RGB branch and the optical flow branch takes ResNet-50 as a basic network, and a plurality of cross-mode attention modules are uniformly inserted in two stages of res3 and res 4; a time sequence segmentation network framework is adopted to averagely divide the whole video into m segments, each segment randomly selects a frame as the input of the network, so that m results are obtained, and the final video prediction result is based on the average value of the m results.
2. The method of claim 1, wherein the cross-modality attention module is: x and Y represent inputs from the RGB and optical flow branches, respectively, the query Q, the dictionary K, and the value V being X or Y generated by a 1X1 convolution; multiplying Q and K to obtain M, and representing the attention weight distribution of each pixel on the whole feature map; multiplying M and V, namely selectively acquiring information Z from V, performing nonlinear transformation after Z is acquired, and performing residual error connection on the transformed result and the original input to acquire a final result.
3. The method of claim 2, wherein the cross-modality attention module performs dimensionality reduction operations in channel dimensions while obtaining Q, K, V through convolution operations to reduce computational complexity and save GPU space; the operation is simplified by a maximization pooling operation before the convolution operation, the dimensionality is increased to be consistent with the input dimensionality through the convolution operation after Z is obtained, and parameters of the BN are initialized to be zero through the BN, wherein the BN is batch standardization.
4. The method of claim 1, wherein the deriving RGB and optical flow from the video to be classified comprises:
a) for RGB, frames are directly intercepted from an original video to be classified, and then the frames are scaled to a specified resolution and used as input of a neural network structure of RGB branches;
b) for optical flow, RGB images of two adjacent frames are extracted through an optical flow algorithm, a plurality of continuous frames of optical flow are stacked together to be used as input of a neural network structure of optical flow branches, and the resolution of the optical flow is consistent with that of RGB.
5. The method of claim 1, wherein step 4) performs video classification by one of the following methods:
a) only the result of RGB branching is adopted for video classification;
b) the video classification is performed by weighted summation of the two results from the two branches.
6. The method of claim 1, wherein the training process of the neural network structure of the RGB and optical flow branches comprises: firstly, training an optical flow branch, and then starting iterative training, namely alternately optimizing the RGB branch and the optical flow branch; freezing all parameters of the optical flow branches in the process of training the RGB branches, wherein the parameters comprise a cross-modal attention module in the optical flow branches, only updating the parameters of the RGB branches, and vice versa when the optical flow branches are trained; for the weighting weights of the two branch results, giving higher weight to the branch with higher precision; and a standard cross entropy loss function and a random gradient descent optimization method are adopted in the training process.
7. The method of claim 1, wherein the neural network structure of the RGB and optical flow branches is migrated to a new data set by fine-tuning the last fully connected layer, thereby implementing a migratory learning.
8. A dual-stream video classification device based on a cross-mode attention mechanism and adopting the method of any one of claims 1 to 7, comprising:
the network construction module is responsible for establishing a neural network structure of the RGB branch and the optical flow branch, and comprises a cross-modal attention module;
the data processing module is responsible for obtaining RGB and optical flow according to the video to be classified and inputting the RGB and optical flow into the neural network structures of the RGB branch and the optical flow branch respectively;
the information fusion module is responsible for carrying out information interaction on input RGB and optical flow, and the neural network structures of the RGB branches and the optical flow branches through the cross-modal attention module to realize cross-modal information fusion;
and the video classification module is in charge of performing video classification according to the result of information fusion obtained by the neural network structure of the RGB branch and the optical flow branch.
CN201910294018.XA 2018-12-26 2019-04-12 Double-current video classification method and device based on cross-mode attention mechanism Active CN110188239B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811601971 2018-12-26
CN2018116019716 2018-12-26

Publications (2)

Publication Number Publication Date
CN110188239A CN110188239A (en) 2019-08-30
CN110188239B true CN110188239B (en) 2021-06-22

Family

ID=67714102

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910294018.XA Active CN110188239B (en) 2018-12-26 2019-04-12 Double-current video classification method and device based on cross-mode attention mechanism

Country Status (1)

Country Link
CN (1) CN110188239B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110852273B (en) * 2019-11-12 2023-05-16 重庆大学 Behavior recognition method based on reinforcement learning attention mechanism
CN111160452A (en) * 2019-12-25 2020-05-15 北京中科研究院 Multi-modal network rumor detection method based on pre-training language model
CN111104553B (en) * 2020-01-07 2023-12-12 中国科学院自动化研究所 Efficient motor complementary neural network system
CN111325155B (en) * 2020-02-21 2022-09-23 重庆邮电大学 Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy
CN111476131B (en) * 2020-03-30 2021-06-11 北京微播易科技股份有限公司 Video processing method and device
CN111723239B (en) * 2020-05-11 2023-06-16 华中科技大学 Video annotation method based on multiple modes
CN111709304B (en) * 2020-05-21 2023-05-05 江南大学 Behavior recognition method based on space-time attention-enhancing feature fusion network
CN111709306B (en) * 2020-05-22 2023-06-09 江南大学 Double-flow network behavior identification method based on multilevel space-time feature fusion enhancement
CN111428699B (en) * 2020-06-10 2020-09-22 南京理工大学 Driving fatigue detection method and system combining pseudo-3D convolutional neural network and attention mechanism
CN111931713B (en) * 2020-09-21 2021-01-29 成都睿沿科技有限公司 Abnormal behavior detection method and device, electronic equipment and storage medium
US20240005628A1 (en) * 2020-11-19 2024-01-04 Intel Corporation Bidirectional compact deep fusion networks for multimodality visual analysis applications
CN112489092B (en) * 2020-12-09 2023-10-31 浙江中控技术股份有限公司 Fine-grained industrial motion modality classification method, storage medium, device and apparatus
CN112650886B (en) * 2020-12-28 2022-08-02 电子科技大学 Cross-modal video time retrieval method based on cross-modal dynamic convolution network
CN112949433B (en) * 2021-02-18 2022-07-22 北京百度网讯科技有限公司 Method, device and equipment for generating video classification model and storage medium
CN113657425B (en) * 2021-06-28 2023-07-04 华南师范大学 Multi-label image classification method based on multi-scale and cross-modal attention mechanism
CN115393779B (en) * 2022-10-31 2023-03-24 济宁九德半导体科技有限公司 Control system and control method for laser cladding metal ball manufacturing
CN116776157B (en) * 2023-08-17 2023-12-12 鹏城实验室 Model learning method supporting modal increase and device thereof
CN117422704B (en) * 2023-11-23 2024-08-13 南华大学附属第一医院 Cancer prediction method, system and equipment based on multi-mode data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108681695A (en) * 2018-04-26 2018-10-19 北京市商汤科技开发有限公司 Video actions recognition methods and device, electronic equipment and storage medium
CN109034001A (en) * 2018-07-04 2018-12-18 安徽大学 Cross-modal video saliency detection method based on space-time clues

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107273800B (en) * 2017-05-17 2020-08-14 大连理工大学 Attention mechanism-based motion recognition method for convolutional recurrent neural network
CN107609460B (en) * 2017-05-24 2021-02-02 南京邮电大学 Human body behavior recognition method integrating space-time dual network flow and attention mechanism
CN108388900B (en) * 2018-02-05 2021-06-08 华南理工大学 Video description method based on combination of multi-feature fusion and space-time attention mechanism

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108681695A (en) * 2018-04-26 2018-10-19 北京市商汤科技开发有限公司 Video actions recognition methods and device, electronic equipment and storage medium
CN109034001A (en) * 2018-07-04 2018-12-18 安徽大学 Cross-modal video saliency detection method based on space-time clues

Also Published As

Publication number Publication date
CN110188239A (en) 2019-08-30

Similar Documents

Publication Publication Date Title
CN110188239B (en) Double-current video classification method and device based on cross-mode attention mechanism
CN108520535B (en) Object classification method based on depth recovery information
Qassim et al. Compressed residual-VGG16 CNN model for big data places image recognition
EP3732619B1 (en) Convolutional neural network-based image processing method and image processing apparatus
Lee et al. From big to small: Multi-scale local planar guidance for monocular depth estimation
CN109584337B (en) Image generation method for generating countermeasure network based on condition capsule
Ricci et al. Monocular depth estimation using multi-scale continuous CRFs as sequential deep networks
Xu et al. Learning deep structured multi-scale features using attention-gated crfs for contour prediction
US20220215227A1 (en) Neural Architecture Search Method, Image Processing Method And Apparatus, And Storage Medium
CN112446476A (en) Neural network model compression method, device, storage medium and chip
Rahmon et al. Motion U-Net: Multi-cue encoder-decoder network for motion segmentation
CN111797683A (en) Video expression recognition method based on depth residual error attention network
CN108960059A (en) A kind of video actions recognition methods and device
CN110378208B (en) Behavior identification method based on deep residual error network
CN110222718B (en) Image processing method and device
Hara et al. Towards good practice for action recognition with spatiotemporal 3d convolutions
Jia et al. Stacked denoising tensor auto-encoder for action recognition with spatiotemporal corruptions
Yang et al. A robust iris segmentation using fully convolutional network with dilated convolutions
CN114219824A (en) Visible light-infrared target tracking method and system based on deep network
CN110782503B (en) Face image synthesis method and device based on two-branch depth correlation network
Chacon-Murguia et al. Moving object detection in video sequences based on a two-frame temporal information CNN
Li et al. SAT-Net: Self-attention and temporal fusion for facial action unit detection
CN115527275A (en) Behavior identification method based on P2CS _3DNet
Wang Micro-expression Recognition Based on Multi-Scale Attention Fusion
CN114202801A (en) Gesture recognition method based on attention-guided airspace map convolution simple cycle unit

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant