Visual Mechanisms Inspired Efficient Transformers for Image and Video Quality Assessment

Junyong You¹⁰ &
Zheng Zhang¹¹

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 652))

Included in the following conference series:

Future of Information and Communication Conference

817 Accesses
3 Citations

Abstract

Visual (image, video) quality assessments can be modelled by visual features in different domains, e.g., spatial, frequency, and temporal domains. Perceptual mechanism in the human visual system (HVS) play a crucial role in the generation of quality perception. This paper proposes a general framework for no-reference visual quality assessment using efficient windowed transformer architectures. A lightweight module for multi-stage channel attention is integrated into the Swin (shifted window) Transformer. Such module can represent appropriate perceptual mechanisms in image quality assessment (IQA) to build an accurate IQA model. Meanwhile, representative features for image quality perception in the spatial and frequency domains can also be derived from the IQA model, which are then fed into another windowed transformer architecture for video quality assessment (VQA). The VQA model efficiently reuses attention information across local windows to tackle the issue of expensive time and memory complexities of original transformer. Experimental results on both large-scale IQA and VQA databases demonstrate that the proposed quality assessment models outperform other state-of-the-art models by large margins.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 159.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 199.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

QualityNet: A multi-stream fusion framework with spatial and channel attention for blind image quality assessment

Article Open access 29 October 2024

Structured Computational Modeling of Human Visual System for No-reference Image Quality Assessment

Article Open access 04 January 2021

MadFormer: multi-attention-driven image super-resolution method based on Transformer

Article 12 March 2024

References

Wang, Z., Sheikh, H.R., Bovik, A.C.: Objective video quality assessment. In: The Handbook of Video Databases: Design and Applications, pp. 1041–1078. CRC Press, (2003)
Google Scholar
Sazaad, P.Z.M., Kawayoke, Y., Horita, Y.: No reference image quality assessment for JPEG2000 based on spatial features. Signal Process. Image Commun. 23(4), 257–268 (2008)
Article Google Scholar
Shahid, M., Rossholm, A., Lövström, B., Zepernick, H.-J.: No-reference image and video quality assessment: a classification and review of recent approaches. EURASIP Journal on Image and Video Processing 2014(1), 1–32 (2014). https://doi.org/10.1186/1687-5281-2014-40
Article Google Scholar
Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process. 21(12), 4695–4708 (2012)
Article MathSciNet MATH Google Scholar
Saad, M.A., Bovik, A.C., Charrier, C.: Blind prediction of natural video quality. IEEE Trans. Image Process. 23(3), 1352–1365 (2014)
Article MathSciNet MATH Google Scholar
Korhonen, J.: Two-level approach for no-reference consumer video quality assessment. IEEE Trans. Image Process. 28(12), 5923–5938 (2019)
Article MathSciNet MATH Google Scholar
Bianco, S., Celona, L., Napoletano, P., Schettini, R.: On the use of deep learning for blind image quality assessment. SIViP 12(2), 355–362 (2017). https://doi.org/10.1007/s11760-017-1166-8
Article Google Scholar
Hosu, V., Lin, H., Sziranyi, T., Saupe, D.: KonIQ-10k: An ecologically valid database for deep learning of blind image quality assessment. IEEE Trans. Image Process. 29, 4041–4056 (2020)
Article MATH Google Scholar
Itti, L., Koch, C.: Computational modelling of visual attention. Nat. Rev. Neurosci. 2, 194–203 (2001)
Article Google Scholar
Engelke, U., Kaprykowsky, H., Zepernick, H.-J., Ndjiki-Nya, P.: Visual attention in quality assessment. IEEE Signal Process. Mag. 28(6), 50–59 (2011)
Article Google Scholar
Kelly, H.: Visual contrast sensitivity. Optica Acta: Int. J. Opt. 24(2), 107–112 (1977)
Article MathSciNet Google Scholar
Geisler, W.S., Perry, J.S.: A real-time foveated multi-resolution system for low-bandwidth video communication. In: SPIE Human Vision Electron. Imaging, San Jose, CA, USA, vol. 3299, pp. 294–305 (1998)
Google Scholar
Zhang, X., Lin, W., Xue, P.: Just-noticeable difference estimation with pixels in images. J. Vis. Commun. 19(1), 30–41 (2007)
Article Google Scholar
You, J., Korhonen, J.: Transformer for image quality assessment. In: IEEE International Conference on Image Processing (ICIP), Anchorage, Alaska, USA (2021)
Google Scholar
Ke, J., Wang, O., Wang, Y., Milanfar, P., Yang, F.: MUSIQ: Multi-scale image quality transformer. In: IEEE/CVF International Conference on Computer Vision (ICCV), Virtual (2021)
Google Scholar
You, J., Korhonen, J.: Attention integrated hierarchical networks for no-reference image quality assessment. J. Vis. Commun., 82 (2022)
Google Scholar
Wu, J., Ma, J., Liang, F., Dong, W., Shi, G., Lin, W.: End-to-end blind image quality prediction with cascaded deep neural network. IEEE Trans. Image Process. 29, 7414–7426 (2020)
Article MATH Google Scholar
Su, S., Yan, Q., Zhu, Y., Zhang, C., Ge, X., Sun, J., Zhang, Y.: Blindly assess image quality in the wild guided by a self-adaptive hyper network. In: IEEE Computer Society Conference on Computer Vision and Pattern Recogniton (CVPR), Virtual (2020)
Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.-F.: Large-scale video classification with convolutional neural networks. In: IEEE Computer Society Conference on Computer Vision and Pattern Recogniton (CVPR), Columbus, OH, USA (2014)
Google Scholar
Varga, D., Szirányi, T.: No-reference video quality assessment via pretrained CNN and LSTM networks. Signal Image Video Proc. 13, 1569–1576 (2019)
Google Scholar
Korhonen, J., Su, Y., You, J.: Blind natural video quality prediction via statistical temporal features and deep spatial features. In: ACM International Conference Multimedia (MM), Seattle, United States (2020)
Google Scholar
Li, D., Jiang, T., Jiang, M.: Quality assessment of in-the-wild videos. In: ACM International Conference Multimedia (MM), Nice France (2019)
Google Scholar
You, J., Korhonen, J.: Deep neural networks for no-reference video quality assessment. In: IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan (2019)
Google Scholar
Tu, Z., Wang, Y., Birkbeck, N., Adsumilli, B., Bovik, A.C.: UGC-VQA: Benchmarking blind video quality assessment for user generated content. IEEE Trans. Image Process. 30, 4449–4464 (2021)
Article Google Scholar
You, J.: Long short-term convolutional transformer for no-reference video quality assessment. In: ACM International Conference Multimedia (MM), Chengdu, China (2021)
Google Scholar
Göring, S., Skowronek, J., Raake, A.: DeViQ - A deep no reference video quality model. In: Proceedings of Human Vision and Electronic Imaging (HVEI), Burlingame, California USA (2018)
Google Scholar
Li, X., Guo, Q., Lu, X.: Spatiotemporal statistics for video quality assessment. IEEE Trans. Image Process. 25(7), 3329–3342 (2018)
Article MathSciNet MATH Google Scholar
Lu, Y., Wu, J., Li, L., Dong, W., Zhang, J., Shi, G.: Spatiotemporal representation learning for blind video quality assessment. IEEE Trans. Circuits Syst. Video Technol. 32(6), 3500–3513 (2021)
Article Google Scholar
Vaswani, A., et al.: Attention is all your need. In: Advance in Neural Information Processing System (NIPS), Long Beach, CA, USA (2017)
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, vol. 1, pp. 4171–4186, Minneapolis, Minnesota, USA (2019)
Google Scholar
Xing, F., Wang, Y-G., Wang, H., Li, L., and Zhu, G.: StarVQA: Space-time attention for video quality assessment. https://doi.org/10.48550/arXiv.2108.09635 (2021)
Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: The efficient Transformer. In: International Conference on Learning Representations (ICLR), Virtual (2020)
Google Scholar
Zhu, C., et al.: Long-short Transformer: Efficient Transformers for language and vision. In: Advance in Neural Information Processing System (NeurIPS), Virtual (2021)
Google Scholar
Liu, Z., et al.: Swin Transformer: Hierarchical vision Transformer using shifted windows. In: IEEE/CVF International Conference on Computer Vision (ICCV), Virtual (2021)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR), Virtual (2021)
Google Scholar
Fang, Y., Zhu, H., Zeng, Y., Ma, K., Wang, Z.: Perceptual quality assessment of smartphone photography. In: IEEE Computer Society Conference on Computer Vision and Pattern Recogniton (CVPR), Virtual (2020)
Google Scholar
Hosu, V., et al.: The Konstanz natural video database (KoNViD-1k). In: International Conference on Quality of Multimedia Experience (QoMEX), Erfurt, Germany (2017)
Google Scholar
Wang, Y., Inguva, S., Adsumilli, B.: YouTube UGC dataset for video compression research. In: International Workshop on Multimedia Signal Processing (MMSP), Kuala Lumpur, Malaysia (2019)
Google Scholar
Ying, Z., Niu, H., Gupta, P., Mahajan, D., Ghadiyaram, D., Bovik A.C.: From patches to pictures (PaQ-2-PiQ): Mapping the perceptual space of picture quality. In: IEEE Computer Society Conference on Computer Vision and Pattern Recogniton (CVPR), Virtual (2020)
Google Scholar
Zhang, W., Ma, K., Zhai, G., Yang, X.: Uncertainty-aware blind image quality assessment in the laboratory and wild. IEEE Trans. Image Process 30, 3474–3486 (2021)
Article Google Scholar
Li, D., Jiang, T., Jiang, M.: Unified quality assessment of in-the-wild videos with mixed datasets training. Int. J. Comput. Vis. 129, 1238–1257 (2021)
Article Google Scholar
ITU-T Recommendation P.910. Subjective video quality assessment methods for multimedia applications,” ITU (2008)
Google Scholar
Virtanen, T., Nuutinen, M., Vaahteranoksa, M., Oittinen, P., Häkkinen, J.: CID2013: A database for evaluating no-reference image quality assessment algorithms. IEEE Trans. Image Process 24(1), 390–402 (2015)
Article MathSciNet MATH Google Scholar
Ghadiyaram, D., Bovik, A.C.: Massive online crowdsourced study of subjective and objective picture quality. IEEE Trans. Image Process 25(1), 372–387 (2016)
Article MathSciNet MATH Google Scholar
Sinno, Z., Bovik, A.C.: Large-scale study of perceptual video quality. IEEE Trans. Image Process. 28(2), 612–627 (2019)
Article MathSciNet MATH Google Scholar
Jung, A.B., Wada, K., Crall, J., et al.: Imgaug, https://github.com/aleju/imgaug

Download references

Author information

Authors and Affiliations

NORCE Norwegian Research Centre, Bergen, Norway
Junyong You
Hong Kong University of Science and Technology, Hong Kong, China
Zheng Zhang

Authors

Junyong You
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junyong You .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Saga University, Saga, Japan
Kohei Arai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

You, J., Zhang, Z. (2023). Visual Mechanisms Inspired Efficient Transformers for Image and Video Quality Assessment. In: Arai, K. (eds) Advances in Information and Communication. FICC 2023. Lecture Notes in Networks and Systems, vol 652. Springer, Cham. https://doi.org/10.1007/978-3-031-28073-3_33

Download citation

DOI: https://doi.org/10.1007/978-3-031-28073-3_33
Published: 02 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28072-6
Online ISBN: 978-3-031-28073-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Visual Mechanisms Inspired Efficient Transformers for Image and Video Quality Assessment

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

QualityNet: A multi-stream fusion framework with spatial and channel attention for blind image quality assessment

Structured Computational Modeling of Human Visual System for No-reference Image Quality Assessment

MadFormer: multi-attention-driven image super-resolution method based on Transformer

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Visual Mechanisms Inspired Efficient Transformers for Image and Video Quality Assessment

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

QualityNet: A multi-stream fusion framework with spatial and channel attention for blind image quality assessment

Structured Computational Modeling of Human Visual System for No-reference Image Quality Assessment

MadFormer: multi-attention-driven image super-resolution method based on Transformer

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation