More Web Proxy on the site http://driver.im/

research-article

Elastic temporal alignment for few‐shot action recognition

Authors:

Yanwen GuoAuthors Info & Claims

IET Computer Vision, Volume 17, Issue 1

Pages 39 - 50

https://doi.org/10.1049/cvi2.12127

Published: 04 August 2022 Publication History

Abstract

Few‐shot action recognition aims to learn a classification model with good generalisation ability when trained with only a few labelled videos. However, it is difficult to learn discriminative feature representations for videos in such a setting. The Elastic Temporal Alignment (ETA) for few‐shot action recognition is proposed. First, a convolutional neural network is employed to extract feature representations of video frames sparsely sampled from videos. In order to obtain the similarity of two videos, a temporal alignment estimation function is utilised to estimate the matching score between each pair of frames from the two videos through an elastic alignment mechanism. The analysis shows that when we judge whether two frames from respective videos are matched, multiple adjacent frames in the videos should be considered, so as to embody the temporal information. Thus, before feeding per‐frame feature vectors of videos into the temporal alignment estimation function, a temporal message passing function is leveraged to propagate the information of per‐frame features in the temporal domain. The method has been evaluated on four action recognition datasets, including Kinetics, Something‐Something V2, HMDB51, and UCF101. The experimental results verify the effectiveness of ETA and show its superiority over state‐of‐the‐art methods.

References

[1]

Tran, D., et al.: A closer look at spatiotemporal convolutions for action recognition. In: Proc. Of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459, Salt Lake City (2018)

[2]

Li, Y., et al.: Tea: temporal excitation and aggregation for action recognition. In: Proc. Of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 906–915, Seattle (2020)

[3]

Yang, C., et al.: Temporal pyramid network for action recognition. In: Proc. Of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 588–597, Seattle (2020)

[4]

Zhang, S., et al.: V4d:4d convolutional neural networks for video‐level representation learning. In: Proc. Of the International Conference on Learning Representations, Addis Ababa (2020)

[5]

Zhao, H., et al.: Multi‐mode neural network for human action recognition. IET Comput. Vis. 14(8), 587–596 (2020). https://doi.org/10.1049/iet-cvi.2019.0761

[6]

Wang, L., et al.: TDN: temporal difference networks for efficient action recognition. In: Proc. Of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1895–1904. Computer Vision Foundation/IEEE (2021)

[7]

Chen, W., et al.: A closer look at few‐shot classification. In: Proc. Of the 7th International Conference on Learning Representations, New Orleans (2019)

[8]

Ziko, I.M., et al.: Laplacian regularized few‐shot learning. In: Proc. Of the 37th International Conference on Machine Learning, vol. 119, pp. 11660–11670 (2020)

[9]

Baik, S., Hong, S., Lee, K.M.: Learning to forget for meta‐learning. In: Proc. Of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2376–2384. Computer Vision Foundation/IEEE, Seattle (2020)

[10]

Baik, S., et al.: Meta‐learning with task‐adaptive loss function for few‐shot learning. In: Proc. Of the IEEE International Conference on Computer Vision, pp. 9465–9474, Montreal (2021)

[11]

Wang, Z., et al.: Learning to learn dense Gaussian processes for few‐shot learning. Adv. Neural Inf. Process. Syst. (2021)

[12]

Lai, N., et al.: Learning to learn adaptive classifier‐predictor for few‐shot learning. IEEE Transact. Neural Networks Learn. Syst. 32(8), 3458–3470 (2021). https://doi.org/10.1109/tnnls.2020.3011526

[13]

Zhu, L., Yang, Y.: Label independent memory for semi‐supervised few‐shot video classification. IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 273–285 (2022)

[14]

Zhang, H., et al.: Few‐shot action recognition with permutation‐invariant attention. In: Proc. Of the European Conference on Computer Vision, Glasgow (2020)

[15]

Cao, K., et al.: Few‐shot video classification via temporal alignment. In: Proc. Of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10615–10624, Seattle (2020)

[16]

Chen, X., et al.: Deep manifold learning combined with convolutional neural networks for action recognition. IEEE Transact. Neural Networks Learn. Syst. 29(9), 3938–3952 (2018). https://doi.org/10.1109/tnnls.2017.2740318

[17]

Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: Proc. Of the IEEE International Conference on Computer Vision, pp. 5843–5851, Venice (2017)

[18]

Simonyan, K., Zisserman, A.: Two‐stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 568–576 (2014)

[19]

Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Proc. Of the European Conference on Computer Vision, vol. 9912, pp. 20–36, Amsterdam (2016)

[20]

Tran, D., et al.: Learning spatiotemporal features with 3d convolutional networks. In: Proc. Of the IEEE International Conference on Computer Vision, pp. 4489–4497, Santiago (2015)

[21]

Wang, X., et al.: Non‐local neural networks. In: Proc. Of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803, Salt Lake City (2018)

[22]

Vinyals, O., et al.: Matching networks for one shot learning. Adv. Neural Inf. Process. Syst. 3630–3638 (2016)

[23]

Snell, J., Swersky, K., Zemel, R.S.: Prototypical networks for few‐shot learning. Adv. Neural Inf. Process. Syst. 4077–4087 (2017)

[24]

Finn, C., Abbeel, P., Levine, S.: Model‐agnostic meta‐learning for fast adaptation of deep networks. In: Proc. Of the 34th International Conference on Machine Learning, pp. 1126–1135, Sydney (2017)

[25]

Liu, Y., et al.: Learning to propagate labels: transductive propagation network for few‐shot learning. In: Proc. Of the 7th International Conference on Learning Representations, New Orleans (2019)

[26]

Zhang, C., et al.: Deepemd: few‐shot image classification with differentiable earth mover’s distance and structured classifiers. In: Proc. Of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12200–12210. Computer Vision Foundation/IEEE, Seattle (2020)

[27]

Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 26(1), 43–49 (1978). https://doi.org/10.1109/tassp.1978.1163055

[28]

Cuturi, M., Blondel, M.: Soft‐dtw: a differentiable loss function for time‐series. In: Proc. Of the 34th International Conference on Machine Learning, pp. 894–903. PMLR, Sydney (2017)

[29]

Fu, Y., et al.: Embodied one‐shot video recognition: learning from actions of a virtual embodied agent. In: Proc. Of the 27th ACM International Conference on Multimedia, pp. 411–419. ACM, Nice (2019)

[30]

Fu, Y., et al.: Depth guided adaptive meta‐fusion network for few‐shot video recognition. In: Proc. Of the 28th ACM International Conference on Multimedia, pp. 1142–1151. ACM, Seattle (2020)

[31]

Zhang, S., Zhou, J., He, X.: Learning implicit temporal alignment for few‐shot video classification. In: Proc. Of the Thirtieth International Joint Conference on Artificial Intelligence, pp. 1309–1315. ijcai.org, Montreal (2021)

[32]

Perrett, T., et al.: Temporal‐relational crosstransformers for few‐shot action recognition. In: Proc. Of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 475–484. Computer Vision Foundation/IEEE (2021)

[33]

Zhu, X., et al.: Few‐shot action recognition with prototype‐centered attentive learning. In: Proc. Of the British Machine Vision Conference. BMVA Press (2021)

[34]

Li, S., et al.: Ta2n: two‐stage action alignment network for few‐shot action recognition. In: Proc. Of the Thirty‐Sixth AAAI Conference on Artificial Intelligence. AAAI Press (2022)

[35]

Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997). https://doi.org/10.1109/78.650093

[36]

Cho, K., et al.: Learning phrase representations using RNN encoder‐decoder for statistical machine translation. In: Proc. Of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1724–1734. ACLDoha (2014)

[37]

Kay, W., et al.: The Kinetics Human Action Video Dataset. CoRR (2017). abs/1705.06950

[38]

Kuehne, H., et al.: Hmdb: a large video database for human motion recognition. In: Proc. Of the IEEE International Conference on Computer Vision, pp. 2556–2563, Barcelona (2011)

[39]

Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A Dataset of 101 Human Actions Classes from Videos in the Wild. CoRR (2012). abs/1212.0402

[40]

Lin, W., et al.: Human in Events: A Large‐Scale Benchmark for Human‐Centric Video Analysis in Complex Events. CoRR (2020). abs/2005.04490

[41]

Wang, X., et al.: Semantic‐guided relation propagation network for few‐shot action recognition. In: Proc. Of the ACM Multimedia Conference, Virtual Event, pp. 816–825. ACM, China (2021)

[42]

Thatipelli, A., et al.: Spatio‐temporal relation modeling for few‐shot action recognition. In: Proc. Of the IEEE Conference on Computer Vision and Pattern Recognition. Computer Vision Foundation/IEEE (2022)

[43]

He, K., et al.: Deep residual learning for image recognition. In: Proc. Of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, Las Vegas (2016)

[44]

Deng, J., et al.: Imagenet: a large‐scale hierarchical image database. In: Proc. Of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, Miami (2009)

[45]

Bottou, L.: Large‐scale machine learning with stochastic gradient descent. In: Proc. Of the 19th International Conference on Computational Statistics, pp. 177–186, Paris (2010)

Recommendations

Learning Maximum Margin Temporal Warping for Action Recognition
ICCV '13: Proceedings of the 2013 IEEE International Conference on Computer Vision

Temporal misalignment and duration variation in video actions largely influence the performance of action recognition, but it is very difficult to specify effective temporal alignment on action sequences. To address this challenge, this paper proposes a ...
Action matching network: open-set action recognition using spatio-temporal representation matching
Abstract
In this paper, we address an open-set action recognition problem. While the closed-set action recognition classifies test samples into the same classes of actions used for model training, the problem of the open-set action recognition is more ...
Skeleton-based action recognition with temporal action graph and temporal adaptive graph convolution structure
Abstract
Skeleton-based action recognition has recently achieved much attention since they can robustly convey the action information. Recently, many studies have shown that graph convolutional networks (GCNs), which generalize CNNs to more generic non-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IET Computer Vision

IET Computer Vision Volume 17, Issue 1

February 2023

121 pages

EISSN:1751-9640

DOI:10.1049/cvi2.v17.1

Issue’s Table of Contents

© 2022 The Authors. IET Computer Vision published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.

This is an open access article under the terms of the Creative Commons Attribution‐NonCommercial‐NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non‐commercial and no modifications or adaptations are made.

Publisher

John Wiley & Sons, Inc.

United States

Publication History

Published: 04 August 2022

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents