A GPU optimization workflow for real-time execution of ultra-high frame rate computer vision applications

Mohsen Nourazar¹,
Brian G. Booth¹ &
Bart Goossens¹

411 Accesses
1 Citation
Explore all metrics

Abstract

This work proposes a GPU optimization methodology for real-time execution of ultra high frame rate applications with small frame sizes. While the use of GPUs for offline processing is well-established, real-time execution remains challenging due to the lack of real-time execution guarantees, especially for embedded GPUs. Our methodology introduces guidelines and a workflow by focusing on: (a) controlling latency by means of minimization of CPU-GPU interactions; (b) computation pruning; and (c) inter/intra-kernel optimizations. Furthermore, our approach takes advantage of multi-frame processing to attain significantly higher throughput at the cost of increased latency when the application permits such trade-offs. To evaluate our optimization methodology, we applied it to the monitoring and controlling of laser powder bed fusion machines, a widely used metal additive manufacturing technique. Results show that in the considered application, the required performance could be obtained on a Jetson Xavier AGX platform, and by sacrificing latency, significantly higher throughput was achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Systematic Optimization of Image Processing Pipelines Using GPUs

Optical flow algorithms optimized for speed, energy and accuracy on embedded GPUs

Article 14 March 2023

Real-time optical flow processing on embedded GPU: an hardware-aware algorithm to implementation strategy

Article 29 November 2021

Data availability

The data used in this study is not available for public sharing as it was obtained under license.

Notes

In our work, a frame is considered "small", if it fits within the shared memory of a streaming multiprocessor (SM) and the size of the work (e.g., the number of pixels or elements to be processed) falls within the range of thread-block size.
Warp refers to a group of threads (typically 32), which execute the same instruction simultaneously on a single SM.

References

Abe, F., Osakada, K., Shiomi, M., Uematsu, K., Matsumoto, M.: The manufacturing of hard tools from metallic powders by selective laser melting. J. Mater. Process. Technol. 111(1–3), 210–213 (2001)
Article CAS Google Scholar
Adnan, AM., Radhakrishnan, S., Karabuk, S.: Efficient Kernel Fusion Techniques for Massive Video Data Analysis on GPGPUs. arXiv preprint arXiv:1509.04394 (2015)
Adnan, M., Lu, Y., Jones, A., Cheng, F.T.: Application of the fog computing paradigm to additive manufacturing process monitoring and control. IEEE Trans. Multimed. 21, 6 (2021)
Google Scholar
Allen, T.: Improving real-time performance with CUDA persistent threads (CuPer) on the Jetson TX2. Concurr. Real-Time (2018)
Booth, B., Heylen, R., Nourazar, M., Verhees, D., Philips, W., Bey-Temsamani, A.: Encoding stability into laser powder bed fusion monitoring using temporal features and pore density modeling. Sensors 22(10), 3740 (2022)
Article ADS CAS PubMed PubMed Central Google Scholar
Catthoor, F., Danckaert, K., Brockmeyer, E., Kulkarni, K., Kjeldsberg, PG., Van Achteren, T., Omnes, T.: Data Access and Storage Management for Embedded Programmable Processors. Springer Science & Business Media (2002)
Cheng, J., Grossman, M., McKercher, T.: Professional CUDA C Programming. John Wiley & Sons (2014)
CUDA C++ Programming Guide. Accessed: 13 June 2023 (2023)
Farber, R.: CUDA application design and development. Elsevier (2011)
Filipovič, J., Madzin, M., Fousek, J., Matyska, L.: Optimizing CUDA code by kernel fusion: application on BLAS. J. Supercomput. 71(10), 3934–3957 (2015)
Article Google Scholar
Fürtler, J., Bodenstorfer, E., Mayer, K.J., Brodersen, J., Heiss, D., Penz, H., Eckel, C., Gravogl, K., Nachtnebel, H.: High-performance camera module for fast quality inspection in industrial printing applications. Mach. Vis. Appl. Ind. Inspec. XV SPIE 6503, 155–166 (2007)
ADS Google Scholar
Goossens, B., De Vylder, J., Philips, W.: Quasar-a new heterogeneous programming framework for image and video processing algorithms on CPU and GPU. In: 2014 IEEE International Conference on Image Processing (ICIP), IEEE, pp 2183–2185 (2014)
GPUDirect RDMA. https://docs.nvidia.com/cuda /gpudirect-rdma/index.html. Accessed: 28 May 2023 (2023)
Gupta, K., Stuart, JA., Owens, JD.: A study of persistent threads style GPU programming for GPGPU workloads. IEEE (2012)
He, L., Ren, X., Gao, Q., Zhao, X., Yao, B., Chao, Y.: The connected-component labeling problem: a review of state-of-the-art algorithms. Pattern Recogn. 70, 25–43 (2017)
Article ADS Google Scholar
Kubík, P., Šebek, F., Krejčí, P., Brabec, M., Tippner, J., Dvořáček, O., Lechowicz, D., Frybort, S.: Linear woodcutting of European beech: experiments and computations. Wood Sci. Technol. 57(1), 51–74 (2023)
Article Google Scholar
Li, A., Zheng, B., Pekhimenko, G., Long, F.: Automatic horizontal fusion for GPU kernels. In: 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), IEEE, pp 14–27 (2022)
Liu, X., Guo, Y., Zhang, W., Wu, D., Huang, R., Yang, M., Lu, B.: Dynamic formation characteristics and mechanism of hybrid laser arc welding surface layer by Ni-based filler metal based on rotating laser induction. J. Mater. Res. Technol. 20, 3600–3615 (2022)
Article CAS Google Scholar
Membarth, R., Reiche, O., Hannig, F., Teich, J., Körner, M., Eckert, W.: Hipa cc: A domain-specific language and compiler for image processing. IEEE Trans. Parallel Distrib. Syst. 27(1), 210–224 (2015)
Article Google Scholar
Pratt-Szeliga, PC., Fawcett, JW., Welch, RD.: Rootbeer: Seamlessly using gpus from java. In: 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, IEEE, pp 375–380 (2012)
Qiao, B., Özkan, MA., Teich, J., Hannig, F.: The best of both worlds: combining CUDA graph with an image processing DSL. In: 2020 57th ACM/IEEE Design Automation Conference (DAC), IEEE, pp 1–6 (2020)
Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S.: Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices 48(6), 519–530 (2013)
Article Google Scholar
Reinke, P., Beckmann, T., Ahlers, C., Ahlrichs, J., Hammou, L., Schmidt, M.: High-speed digital photography of vapor cavitation in a narrow gap flow. Fluids 8(2), 44 (2023)
Article ADS Google Scholar
Scime, L., Fisher, B., Beuth, J.: Using coordinate transforms to improve the utility of a fixed field of view high speed camera for additive manufacturing applications. Manuf. Lett. 15, 104–106 (2018)
Article Google Scholar
Sepasgozar, S.M., Shi, A., Yang, L., Shirowzhan, S., Edwards, D.J.: Additive manufacturing applications for industry 4.0: a systematic critical review. Buildings 10(12), 231 (2020)
Article Google Scholar
Steinberger, M., Kenzel, M., Boechat, P., Kerbl, B., Dokter, M., Schmalstieg, D.: Whippletree: task-based scheduling of dynamic workloads on the GPU. ACM Trans. Graph. (TOG) 33(6), 1–11 (2014)
Article Google Scholar
Truong, L., Barik, R., Totoni, E., Liu, H., Markley, C., Fox, A., Shpeisman, T.: Latte: A language, compiler, and runtime for elegant and efficient deep neural networks. In: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp 209–223, (2016)
Varga, M., Ventura, Cervellón, A., Leroch, S., Eder, S., Rojacz, H., Rodríguez Ripoll, M.: Fundamental abrasive contact at high speeds: scratch testing in experiment and simulation. In: Wear 522:204696, 24th International Conference on Wear of Materials (2023)
Vasilache, N., Zinenko, O., Theodoridis, T., Goyal, P., DeVito, Z., Moses, WS., Verdoolaege, S., Adams, A., Cohen, A.: Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730 (2018)
Wienke, S., Springer, P., Terboven, C., an Mey, D.: OpenACC-first experiences with real-world applications. In: Euro-Par 2012 Parallel Processing: 18th International Conference, Euro-Par 2012, Rhodes Island, Greece, August 27-31, (2012). Proceedings 18, pp 859–870. Springer (2012)
Xiao, S., Feng, Wc.: Inter-block GPU communication via fast barrier synchronization. In: 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), IEEE, pp 1–12 (2010)
Zhang, L., Wahib, M., Chen, P., Meng, J., Wang, X., Matsuoka, S.: Persistent Kernels for Iterative Memory-bound GPU Applications. arXiv preprint arXiv:2204.02064 (2022)
Zou, A., Li, J., Gill, CD., Zhang, X.: RTGPU: Real-time GPU scheduling of hard deadline parallel tasks with fine-grain utilization. IEEE Trans. Parallel Distrib. Syst. (2023)

Download references

Acknowledgements

This work is financially supported by the VLAIO ICON project‘Vision in the Loop’(HBC.2019.2808), a collaboration between imec, Flanders Make, Materialise, Dekimo, ESMA and AdditiveLab. The RTX A6000 GPU used for this research was donated by the NVIDIA Corporation.

Author information

Authors and Affiliations

Department of Telecommunications and Information Processing, imec-IPI-Ghent University, 9000, Ghent, Belgium
Mohsen Nourazar, Brian G. Booth & Bart Goossens

Authors

Mohsen Nourazar
View author publications
You can also search for this author in PubMed Google Scholar
Brian G. Booth
View author publications
You can also search for this author in PubMed Google Scholar
Bart Goossens
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

MN and BG designed and developed the optimization methodology. MN carried out the experiments. MN wrote the manuscript with support from BG and BGB. BG and BGB supervised the project. All authors reviewed the manuscript

Corresponding author

Correspondence to Mohsen Nourazar.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Nourazar, M., Booth, B.G. & Goossens, B. A GPU optimization workflow for real-time execution of ultra-high frame rate computer vision applications. J Real-Time Image Proc 21, 5 (2024). https://doi.org/10.1007/s11554-023-01384-7

Download citation

Received: 11 August 2023
Accepted: 29 October 2023
Published: 26 November 2023
DOI: https://doi.org/10.1007/s11554-023-01384-7

A GPU optimization workflow for real-time execution of ultra-high frame rate computer vision applications

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Systematic Optimization of Image Processing Pipelines Using GPUs

Optical flow algorithms optimized for speed, energy and accuracy on embedded GPUs

Real-time optical flow processing on embedded GPU: an hardware-aware algorithm to implementation strategy

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A GPU optimization workflow for real-time execution of ultra-high frame rate computer vision applications

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Systematic Optimization of Image Processing Pipelines Using GPUs

Optical flow algorithms optimized for speed, energy and accuracy on embedded GPUs

Real-time optical flow processing on embedded GPU: an hardware-aware algorithm to implementation strategy

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation