Abstract
Visual (image, video) quality assessments can be modelled by visual features in different domains, e.g., spatial, frequency, and temporal domains. Perceptual mechanism in the human visual system (HVS) play a crucial role in the generation of quality perception. This paper proposes a general framework for no-reference visual quality assessment using efficient windowed transformer architectures. A lightweight module for multi-stage channel attention is integrated into the Swin (shifted window) Transformer. Such module can represent appropriate perceptual mechanisms in image quality assessment (IQA) to build an accurate IQA model. Meanwhile, representative features for image quality perception in the spatial and frequency domains can also be derived from the IQA model, which are then fed into another windowed transformer architecture for video quality assessment (VQA). The VQA model efficiently reuses attention information across local windows to tackle the issue of expensive time and memory complexities of original transformer. Experimental results on both large-scale IQA and VQA databases demonstrate that the proposed quality assessment models outperform other state-of-the-art models by large margins.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Wang, Z., Sheikh, H.R., Bovik, A.C.: Objective video quality assessment. In: The Handbook of Video Databases: Design and Applications, pp. 1041–1078. CRC Press, (2003)
Sazaad, P.Z.M., Kawayoke, Y., Horita, Y.: No reference image quality assessment for JPEG2000 based on spatial features. Signal Process. Image Commun. 23(4), 257–268 (2008)
Shahid, M., Rossholm, A., Lövström, B., Zepernick, H.-J.: No-reference image and video quality assessment: a classification and review of recent approaches. EURASIP Journal on Image and Video Processing 2014(1), 1–32 (2014). https://doi.org/10.1186/1687-5281-2014-40
Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process. 21(12), 4695–4708 (2012)
Saad, M.A., Bovik, A.C., Charrier, C.: Blind prediction of natural video quality. IEEE Trans. Image Process. 23(3), 1352–1365 (2014)
Korhonen, J.: Two-level approach for no-reference consumer video quality assessment. IEEE Trans. Image Process. 28(12), 5923–5938 (2019)
Bianco, S., Celona, L., Napoletano, P., Schettini, R.: On the use of deep learning for blind image quality assessment. SIViP 12(2), 355–362 (2017). https://doi.org/10.1007/s11760-017-1166-8
Hosu, V., Lin, H., Sziranyi, T., Saupe, D.: KonIQ-10k: An ecologically valid database for deep learning of blind image quality assessment. IEEE Trans. Image Process. 29, 4041–4056 (2020)
Itti, L., Koch, C.: Computational modelling of visual attention. Nat. Rev. Neurosci. 2, 194–203 (2001)
Engelke, U., Kaprykowsky, H., Zepernick, H.-J., Ndjiki-Nya, P.: Visual attention in quality assessment. IEEE Signal Process. Mag. 28(6), 50–59 (2011)
Kelly, H.: Visual contrast sensitivity. Optica Acta: Int. J. Opt. 24(2), 107–112 (1977)
Geisler, W.S., Perry, J.S.: A real-time foveated multi-resolution system for low-bandwidth video communication. In: SPIE Human Vision Electron. Imaging, San Jose, CA, USA, vol. 3299, pp. 294–305 (1998)
Zhang, X., Lin, W., Xue, P.: Just-noticeable difference estimation with pixels in images. J. Vis. Commun. 19(1), 30–41 (2007)
You, J., Korhonen, J.: Transformer for image quality assessment. In: IEEE International Conference on Image Processing (ICIP), Anchorage, Alaska, USA (2021)
Ke, J., Wang, O., Wang, Y., Milanfar, P., Yang, F.: MUSIQ: Multi-scale image quality transformer. In: IEEE/CVF International Conference on Computer Vision (ICCV), Virtual (2021)
You, J., Korhonen, J.: Attention integrated hierarchical networks for no-reference image quality assessment. J. Vis. Commun., 82 (2022)
Wu, J., Ma, J., Liang, F., Dong, W., Shi, G., Lin, W.: End-to-end blind image quality prediction with cascaded deep neural network. IEEE Trans. Image Process. 29, 7414–7426 (2020)
Su, S., Yan, Q., Zhu, Y., Zhang, C., Ge, X., Sun, J., Zhang, Y.: Blindly assess image quality in the wild guided by a self-adaptive hyper network. In: IEEE Computer Society Conference on Computer Vision and Pattern Recogniton (CVPR), Virtual (2020)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.-F.: Large-scale video classification with convolutional neural networks. In: IEEE Computer Society Conference on Computer Vision and Pattern Recogniton (CVPR), Columbus, OH, USA (2014)
Varga, D., Szirányi, T.: No-reference video quality assessment via pretrained CNN and LSTM networks. Signal Image Video Proc. 13, 1569–1576 (2019)
Korhonen, J., Su, Y., You, J.: Blind natural video quality prediction via statistical temporal features and deep spatial features. In: ACM International Conference Multimedia (MM), Seattle, United States (2020)
Li, D., Jiang, T., Jiang, M.: Quality assessment of in-the-wild videos. In: ACM International Conference Multimedia (MM), Nice France (2019)
You, J., Korhonen, J.: Deep neural networks for no-reference video quality assessment. In: IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan (2019)
Tu, Z., Wang, Y., Birkbeck, N., Adsumilli, B., Bovik, A.C.: UGC-VQA: Benchmarking blind video quality assessment for user generated content. IEEE Trans. Image Process. 30, 4449–4464 (2021)
You, J.: Long short-term convolutional transformer for no-reference video quality assessment. In: ACM International Conference Multimedia (MM), Chengdu, China (2021)
Göring, S., Skowronek, J., Raake, A.: DeViQ - A deep no reference video quality model. In: Proceedings of Human Vision and Electronic Imaging (HVEI), Burlingame, California USA (2018)
Li, X., Guo, Q., Lu, X.: Spatiotemporal statistics for video quality assessment. IEEE Trans. Image Process. 25(7), 3329–3342 (2018)
Lu, Y., Wu, J., Li, L., Dong, W., Zhang, J., Shi, G.: Spatiotemporal representation learning for blind video quality assessment. IEEE Trans. Circuits Syst. Video Technol. 32(6), 3500–3513 (2021)
Vaswani, A., et al.: Attention is all your need. In: Advance in Neural Information Processing System (NIPS), Long Beach, CA, USA (2017)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, vol. 1, pp. 4171–4186, Minneapolis, Minnesota, USA (2019)
Xing, F., Wang, Y-G., Wang, H., Li, L., and Zhu, G.: StarVQA: Space-time attention for video quality assessment. https://doi.org/10.48550/arXiv.2108.09635 (2021)
Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: The efficient Transformer. In: International Conference on Learning Representations (ICLR), Virtual (2020)
Zhu, C., et al.: Long-short Transformer: Efficient Transformers for language and vision. In: Advance in Neural Information Processing System (NeurIPS), Virtual (2021)
Liu, Z., et al.: Swin Transformer: Hierarchical vision Transformer using shifted windows. In: IEEE/CVF International Conference on Computer Vision (ICCV), Virtual (2021)
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR), Virtual (2021)
Fang, Y., Zhu, H., Zeng, Y., Ma, K., Wang, Z.: Perceptual quality assessment of smartphone photography. In: IEEE Computer Society Conference on Computer Vision and Pattern Recogniton (CVPR), Virtual (2020)
Hosu, V., et al.: The Konstanz natural video database (KoNViD-1k). In: International Conference on Quality of Multimedia Experience (QoMEX), Erfurt, Germany (2017)
Wang, Y., Inguva, S., Adsumilli, B.: YouTube UGC dataset for video compression research. In: International Workshop on Multimedia Signal Processing (MMSP), Kuala Lumpur, Malaysia (2019)
Ying, Z., Niu, H., Gupta, P., Mahajan, D., Ghadiyaram, D., Bovik A.C.: From patches to pictures (PaQ-2-PiQ): Mapping the perceptual space of picture quality. In: IEEE Computer Society Conference on Computer Vision and Pattern Recogniton (CVPR), Virtual (2020)
Zhang, W., Ma, K., Zhai, G., Yang, X.: Uncertainty-aware blind image quality assessment in the laboratory and wild. IEEE Trans. Image Process 30, 3474–3486 (2021)
Li, D., Jiang, T., Jiang, M.: Unified quality assessment of in-the-wild videos with mixed datasets training. Int. J. Comput. Vis. 129, 1238–1257 (2021)
ITU-T Recommendation P.910. Subjective video quality assessment methods for multimedia applications,” ITU (2008)
Virtanen, T., Nuutinen, M., Vaahteranoksa, M., Oittinen, P., Häkkinen, J.: CID2013: A database for evaluating no-reference image quality assessment algorithms. IEEE Trans. Image Process 24(1), 390–402 (2015)
Ghadiyaram, D., Bovik, A.C.: Massive online crowdsourced study of subjective and objective picture quality. IEEE Trans. Image Process 25(1), 372–387 (2016)
Sinno, Z., Bovik, A.C.: Large-scale study of perceptual video quality. IEEE Trans. Image Process. 28(2), 612–627 (2019)
Jung, A.B., Wada, K., Crall, J., et al.: Imgaug, https://github.com/aleju/imgaug
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
You, J., Zhang, Z. (2023). Visual Mechanisms Inspired Efficient Transformers for Image and Video Quality Assessment. In: Arai, K. (eds) Advances in Information and Communication. FICC 2023. Lecture Notes in Networks and Systems, vol 652. Springer, Cham. https://doi.org/10.1007/978-3-031-28073-3_33
Download citation
DOI: https://doi.org/10.1007/978-3-031-28073-3_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28072-6
Online ISBN: 978-3-031-28073-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)