[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Video Coding for Machines: Compact Visual Representation Compression for Intelligent Collaborative Analytics

Published: 20 February 2024 Publication History

Abstract

As an emerging research practice leveraging recent advanced AI techniques, e.g. deep models based prediction and generation, <bold>V</bold>ideo <bold>C</bold>oding for <bold>M</bold>achines (<bold>VCM</bold>) is committed to bridging to an extent separate research tracks of video/image compression and feature compression, and attempts to optimize compactness and efficiency jointly from a unified perspective of high accuracy machine vision and full fidelity human vision. With the rapid advances of deep feature representation and visual data compression in mind, in this paper, we summarize VCM methodology and philosophy based on existing academia and industrial efforts. The development of VCM follows a general rate-distortion optimization, and the categorization of key modules or techniques is established including feature-assisted coding, scalable coding, intermediate feature compression/optimization, and machine vision targeted codec, from broader perspectives of vision tasks, analytics resources, etc. From previous works, it is demonstrated that, although existing works attempt to reveal the nature of scalable representation in bits when dealing with machine and human vision tasks, there remains a rare study in the generality of low bit rate representation, and accordingly how to support a variety of visual analytic tasks. Therefore, we investigate a novel <italic>visual information compression for the analytics taxonomy</italic> problem to strengthen the capability of compact visual representations extracted from multiple tasks for visual analytics. A new perspective of task relationships versus compression is revisited. By keeping in mind the transferability among different machine vision tasks (e.g. high-level semantic and mid-level geometry-related), we aim to support multiple tasks jointly at low bit rates. In particular, to narrow the dimensionality gap between neural network generated features extracted from pixels and a variety of machine vision features/labels (e.g. scene class, segmentation labels), a codebook hyperprior is designed to compress the neural network-generated features. As demonstrated in our experiments, this new hyperprior model is expected to improve feature compression efficiency by estimating the signal entropy more accurately, which enables further investigation of the granularity of abstracting compact features among different tasks.

References

[1]
M. Akbari, J. Liang, and J. Han, “DSSLIC: Deep semantic segmentation-based layered image compression,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., 2019, pp. 2042–2046.
[2]
S.R. Alvar and I. V. Bajić, “Multi-task learning with compressible features for collaborative intelligence,” in Proc. IEEE Int. Conf. Image Process., 2019, pp. 1705–1709.
[3]
S.R. Alvar and I. V. Bajić, “Bit allocation for multi-task collaborative intelligence,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., 2020, pp. 4342–4346.
[4]
S.R. Alvar and I. V. Bajić, “Pareto-optimal bit allocation for collaborative intelligence,” IEEE Trans. Image Process., vol. 30, pp. 3348–3361, 2021.
[5]
I. V. Bajic, W. Lin, and Y. Tian, “Collaborative intelligence: Challenges and opportunities,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process., 2021, pp. 8493–8497.
[6]
J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” in Proc. Int. Conf. Learn. Representations, 2018, pp. 1–13.
[7]
F. Bellard, “BPG image format,” Accessed: May 28, 2021, 2014. [Online]. Available: http://bellard.org/bpg/
[8]
L. D. Chamain, F. Racapé, J. Bégaint, A. Pushparaja, and S. Feltman, “End-to-end optimized image compression for machines, a study,” in Proc. IEEE Data Compression Conf., 2021, pp. 163–172.
[9]
J. Chang et al., “Layered conceptual image compression via deep semantic synthesis,” in Proc. IEEE Int. Conf. Image Process., 2019, pp. 694–698.
[10]
J. Chang, Z. Zhao, L. Yang, C. Jia, J. Zhang, and S. Ma, “Thousand to one: Semantic prior modeling for conceptual coding,” in Proc. IEEE Int. Conf. Multimedia Expo, 2021, pp. 1–6.
[11]
D. Chen, Q. Chen, and F. Zhu, “Pixel-level texture segmentation based AV1 video compression,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2019, pp. 1622–1626.
[12]
Z. Chen, L.-Y. Duan, S. Wang, W. Lin, and A. C. Kot, “Data representation in hybrid coding framework for feature maps compression,” in Proc. IEEE Int. Conf. Image Process., 2020, pp. 3094–3098.
[13]
Z. Chen, K. Fan, S. Wang, L.-Y. Duan, W. Lin, and A. Kot, “Lossy intermediate deep learning feature compression and evaluation,” in Proc. 27th ACM Int. Conf. ACM Trans. Multimedia, 2019, pp. 2414–2422.
[14]
Z. Chen, K. Fan, S. Wang, L.-Y. Duan, W. Lin, and A. C. Kot, “Toward intelligent sensing: Intermediate deep feature compression,” IEEE Trans. Image Process., vol. 29, pp. 2230–2243, 2019.
[15]
Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image compression with discretized gaussian mixture likelihoods and attention modules,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 7936–7945.
[16]
H.C. Chun, P. Nguyen, and C.-T. Lai, “Learned prior information for image compression,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit. Workshops, 2019, pp. 7936–7945.
[17]
H. Choi and I. V. Bajić, “Scalable image coding for humans and machines,” IEEE Trans. Image Process., vol. 31, pp. 2739–2754, 2022.
[18]
H. Choi, R. A. Cohen, and I. V. Bajić, “Back-and-forth prediction for deep tensor compression,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 4467–4471.
[19]
J. Choi and B. Han, “Task-aware quantization network for JPEG image compression,” in Proc. IEEE Eur. Conf. Comput. Vis., 2020, pp. 309–324.
[20]
T. M. Cover, Elements of Information Theory. Hoboken, NJ, USA: Wiley, 1999.
[21]
L. Duan, J. Liu, W. Yang, T. Huang, and W. Gao, “Video coding for machines: A paradigm of collaborative compression and intelligent analytics,” IEEE Trans. Image Process., vol. 29, pp. 8680–8695, 2020.
[22]
W. Gao, S. Liu, X. Xu, M. Rafie, Y. Zhang, and I. Curcio, “Recent Standard Development Activities on Video Coding for Machines. arXiv e-prints,” May 2021,.
[23]
W. Gao et al., “Digital retina: A way to make the city brain more efficient by visual coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 11, pp. 4147–4161, Nov. 2021.
[24]
R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1440–1448.
[25]
I. Goodfellow et al., “Generative adversarial nets,” in Proc. Annu. Conf. Neural Inf. Process. Syst., 2014, pp. 2672–2680.
[26]
WG2 Group, “Draft of white paper on motivation and requirements for video coding for machine,” MPEG Technical requirements ISO/IEC JTC 1/SC 29/WG 2:1–13, 2021.
[27]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
[28]
T. M. Hoang, J. Zhou, and Y. Fan, “Image compression with encoder-decoder matched semantic segmentation,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit. Workshops, 2020, pp. 619–623.
[29]
Y. Hou, L. Zheng, and S. Gould, “Learning to structure an image with few colors,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10113–10122.
[30]
Y. Hu, S. Yang, W. Yang, L.-Y. Duan, and J. Liu, “Towards coding for human and machine vision: A scalable image coding approach,” in Proc. IEEE Int. Conf. Multimedia Expo, 2020, pp. 1–6.
[31]
Y. Hu, S. Xia, W. Yang, and J. Liu, “Sensitivity-aware bit allocation for intermediate deep feature compression,” in Proc. IEEE Vis. Commun. Image Process., 2020, pp. 475–478.
[32]
Z. Huang, C. Jia, S. Wang, and S. Ma, “Visual analysis motivated rate-distortion model for image coding,” in Proc. IEEE Int. Conf. Multimedia Expo, 2021, pp. 1–6.
[33]
A. Ikusan and R. Dai, “Rate-distortion optimized hierarchical deep feature compression,” in Proc. IEEE Int. Conf. Multimedia Expo, 2021, pp. 1–6.
[34]
T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of StyleGAN,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2020, pp. 8107–8116.
[35]
S. Kim et al., “Adversarial video compression guided by soft edge detection,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 2193–2197.
[36]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proc. Annu. Conf. Neural Inf. Process. Syst., 2012, pp. 1106–1114.
[37]
N. Le, H. Zhang, F. Cricri, R. Ghaznavi-Youvalari, and E. Rahtu, “Image coding for machines: An end-to-end learned approach,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 1590–1594.
[38]
N. Le, H. Zhang, F. Cricri, R. Ghaznavi-Youvalari, H.R. Tavakoli, and E. Rahtu, “Learned image coding for machines: A content-adaptive approach,” in Proc. IEEE Int. Conf. Multimedia Expo, 2021, pp. 1–6.
[39]
H. Li, Y. Guo, Z. Wang, S. Xia, and W. Zhu, “AdaCompress: Adaptive compression for online computer vision services,” in Proc. Int. Conf. ACM Trans. Multimedia, 2019, pp. 2440–2448.
[40]
J. Lin, R. Zhang, F. Ganz, S. Han, and J.-Y. Zhu, “Enhancing unsupervised video representation learning by decoupling the scene and the motion,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 10129–10137.
[41]
T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755.
[42]
K. Liu, D. Liu, L. Li, N. Yan, and H. Li, “Semantics-to-signal scalable image compression with learned revertible representations,” Int. J. Comput. Vis., vol. 129, no. 9, pp. 2605–2621, 2021.
[43]
F. Locatello et al., “Challenging common assumptions in the unsupervised learning of disentangled representations,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 4114–4124.
[44]
Y. Lou et al., “Towards efficient front-end visual sensing for digital retina: A model-centric paradigm,” IEEE Trans. Multimedia, vol. 22, no. 11, pp. 3002–3013, Nov. 2020.
[45]
S. Ma, T. Huang, C. Reader, and W. Gao, “AVS2? Making video coding smarter [standards in a nutshell],” IEEE Signal Process. Mag., vol. 32, no. 2, pp. 172–183, Mar. 2015.
[46]
M. Oquab et al., “Low bandwidth video-chat compression using deep generative models,” Dec. 2020,.
[47]
A. Orzan, A. Bousseau, P. Barla, and J. Thollot, “Structure-preserving manipulation of photographs,” in Proc. Int. Symp. Non-Photorealistic Animation Rendering, 2007, pp. 103–110.
[48]
N. Patwa, N. Ahuja, S. Somayazulu, O. Tickoo, S. Varadarajan, and S. Koolagudi, “Semantic-preserving image compression,” in Proc. IEEE Int. Conf. Image Process., 2020, pp. 1281–1285.
[49]
R. Prabhakar et al., “Reducing latency and bandwidth for video streaming using keypoint extraction and digital puppetry,” in Proc. Data Compression Conf., 2021, pp. 1–11.
[50]
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2016, pp. 779–788.
[51]
M. A. Shah and B. Raj, “Deriving compact feature representations via annealed contraction,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 2068–2072.
[52]
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Representations, 2015, pp. 1–14.
[53]
S. Singh, S. Abu-El-Haija, N. Johnston, J. Ballé, A. Shrivastava, and G. Toderici, “End-to-end learning of compressible features,” in Proc. IEEE Int. Conf. Image Process., 2020, pp. 3349–3353.
[54]
S. Singh, S. Abu-El-Haija, N. Johnston, J. Ballé, A. Shrivastava, and G. Toderici, “End-to-end learning of compressible features,” in Proc. IEEE Int. Conf. Image Process., 2020, pp. 3349–3353.
[55]
G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (HEVC) standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1649–1668, Dec. 2012.
[56]
S. Suzuki, M. Takagi, K. Hayase, T. Onishi, and A. Shimizu, “Image pre-transformation for recognition-aware image compression,” in Proc. IEEE Int. Conf. Image Process., 2019, pp. 2686–2690.
[57]
S. Suzuki, M. Takagi, S. Takeda, R. Tanida, and H. Kimata, “Deep feature compression with spatio-temporal arranging for collaborative intelligence,” in Proc. IEEE Int. Conf. Image Process., 2020, pp. 3099–3103.
[58]
M. Ulhaq and I. V. Bajić, “Latent space motion analysis for collaborative intelligence,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 8498–8502.
[59]
S. Vandenhende, S. Georgoulis, W. Van Gansbeke, M. Proesmans, D. Dai, and L. Van Gool, “Multi-task learning for dense prediction tasks: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 7, pp. 3614–3633, Jul. 2022.
[60]
S. Wang, S. Wang, X. Zhang, S. Wang, S. Ma, and W. Gao, “Scalable facial image compression with deep feature reconstruction,” in Proc. IEEE Int. Conf. Image Process., 2019, pp. 2691–2695.
[61]
S. Wang, S. Wang, W. Yang, X. Zhang, S. Wang, and S. Ma, “Teacher-student learning with multi-granularity constraint towards compact facial feature representation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 8503–8507.
[62]
S. Wang et al., “Towards analysis-friendly face representation with scalable feature and texture compression,” IEEE Trans. Multimedia, vol. 24, pp. 3169–3181, 2022.
[63]
T.-C. Wang, A. Mallya, and M.-Y. Liu, “One-shot free-view neural talking-head synthesis for video conferencing,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2021, pp. 10039–10049.
[64]
T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 560–576, Jul. 2003.
[65]
Q. Xia, H. Liu, and Z. Ma, “Object-based image coding: A learning-driven revisit,” in Proc. IEEE Int. Conf. Multimedia Expo, 2020, pp. 1–6.
[66]
S. Xia, K. Liang, W. Yang, L.-Y. Duan, and J. Liu, “An emerging coding paradigm VCM: A scalable coding approach beyond feature and signal,” in Proc. IEEE Int. Conf. Multimedia Expo, 2020, pp. 1–6.
[67]
Z. Xie, Y. Lin, Z. Zhang, Y. Cao, S. Lin, and H. Hu, “Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2021, pp. 16679–16688.
[68]
P. Xing, P. Peng, Y. Liang, T. Huang, and Y. Tian, “Binary representation and high efficient compression of 3D CNN features for action recognition,” in Proc. Data Compression Conf., 2020, pp. 400–400.
[69]
N. Yan, D. Liu, H. Li, and F. Wu, “Semantically scalable image coding with compression of feature maps,” in Proc. IEEE Int. Conf. Image Process.2020, pp. 3114–3118.
[70]
S. Yang, Y. Hu, W. Yang, L.-Y. Duan, and J. Liu, “Towards coding for human and machine vision: Scalable face image coding,” IEEE Trans. Multimedia, vol. 23, pp. 2957–2971, 2021.
[71]
Z. Yang et al., “Discernible image compression,” in Proc. ACM Trans. Multimedia, 2020, pp. 1561–1569.
[72]
F. Yu et al., “BDD100K: A diverse driving dataset for heterogeneous multitask learning,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2020, pp. 2633–2642.
[73]
R. Amir et al., “Taskonomy: Disentangling task transfer learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3712–3722.
[74]
J. Zhang and D. Tao, “Empowering things with intelligence: A survey of the progress, challenges, and opportunities in artificial intelligence of things,” IEEE Internet Things J., vol. 8, no. 10, pp. 7789–7817, May 2021.
[75]
Z. Zhang, M. Wang, M. Ma, J. Li, and X. Fan, “MSFC: Deep feature compression in multi-task network,” in Proc. IEEE Int. Conf. Multimedia Expo, 2021, pp. 1–6.
[76]
J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2242–2251.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Pattern Analysis and Machine Intelligence
IEEE Transactions on Pattern Analysis and Machine Intelligence  Volume 46, Issue 7
July 2024
658 pages

Publisher

IEEE Computer Society

United States

Publication History

Published: 20 February 2024

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Dec 2024

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media