Abstract
This work proposes a GPU optimization methodology for real-time execution of ultra high frame rate applications with small frame sizes. While the use of GPUs for offline processing is well-established, real-time execution remains challenging due to the lack of real-time execution guarantees, especially for embedded GPUs. Our methodology introduces guidelines and a workflow by focusing on: (a) controlling latency by means of minimization of CPU-GPU interactions; (b) computation pruning; and (c) inter/intra-kernel optimizations. Furthermore, our approach takes advantage of multi-frame processing to attain significantly higher throughput at the cost of increased latency when the application permits such trade-offs. To evaluate our optimization methodology, we applied it to the monitoring and controlling of laser powder bed fusion machines, a widely used metal additive manufacturing technique. Results show that in the considered application, the required performance could be obtained on a Jetson Xavier AGX platform, and by sacrificing latency, significantly higher throughput was achieved.
Similar content being viewed by others
Data availability
The data used in this study is not available for public sharing as it was obtained under license.
Notes
In our work, a frame is considered "small", if it fits within the shared memory of a streaming multiprocessor (SM) and the size of the work (e.g., the number of pixels or elements to be processed) falls within the range of thread-block size.
Warp refers to a group of threads (typically 32), which execute the same instruction simultaneously on a single SM.
References
Abe, F., Osakada, K., Shiomi, M., Uematsu, K., Matsumoto, M.: The manufacturing of hard tools from metallic powders by selective laser melting. J. Mater. Process. Technol. 111(1–3), 210–213 (2001)
Adnan, AM., Radhakrishnan, S., Karabuk, S.: Efficient Kernel Fusion Techniques for Massive Video Data Analysis on GPGPUs. arXiv preprint arXiv:1509.04394 (2015)
Adnan, M., Lu, Y., Jones, A., Cheng, F.T.: Application of the fog computing paradigm to additive manufacturing process monitoring and control. IEEE Trans. Multimed. 21, 6 (2021)
Allen, T.: Improving real-time performance with CUDA persistent threads (CuPer) on the Jetson TX2. Concurr. Real-Time (2018)
Booth, B., Heylen, R., Nourazar, M., Verhees, D., Philips, W., Bey-Temsamani, A.: Encoding stability into laser powder bed fusion monitoring using temporal features and pore density modeling. Sensors 22(10), 3740 (2022)
Catthoor, F., Danckaert, K., Brockmeyer, E., Kulkarni, K., Kjeldsberg, PG., Van Achteren, T., Omnes, T.: Data Access and Storage Management for Embedded Programmable Processors. Springer Science & Business Media (2002)
Cheng, J., Grossman, M., McKercher, T.: Professional CUDA C Programming. John Wiley & Sons (2014)
CUDA C++ Programming Guide. Accessed: 13 June 2023 (2023)
Farber, R.: CUDA application design and development. Elsevier (2011)
Filipovič, J., Madzin, M., Fousek, J., Matyska, L.: Optimizing CUDA code by kernel fusion: application on BLAS. J. Supercomput. 71(10), 3934–3957 (2015)
Fürtler, J., Bodenstorfer, E., Mayer, K.J., Brodersen, J., Heiss, D., Penz, H., Eckel, C., Gravogl, K., Nachtnebel, H.: High-performance camera module for fast quality inspection in industrial printing applications. Mach. Vis. Appl. Ind. Inspec. XV SPIE 6503, 155–166 (2007)
Goossens, B., De Vylder, J., Philips, W.: Quasar-a new heterogeneous programming framework for image and video processing algorithms on CPU and GPU. In: 2014 IEEE International Conference on Image Processing (ICIP), IEEE, pp 2183–2185 (2014)
GPUDirect RDMA. https://docs.nvidia.com/cuda /gpudirect-rdma/index.html. Accessed: 28 May 2023 (2023)
Gupta, K., Stuart, JA., Owens, JD.: A study of persistent threads style GPU programming for GPGPU workloads. IEEE (2012)
He, L., Ren, X., Gao, Q., Zhao, X., Yao, B., Chao, Y.: The connected-component labeling problem: a review of state-of-the-art algorithms. Pattern Recogn. 70, 25–43 (2017)
Kubík, P., Šebek, F., Krejčí, P., Brabec, M., Tippner, J., Dvořáček, O., Lechowicz, D., Frybort, S.: Linear woodcutting of European beech: experiments and computations. Wood Sci. Technol. 57(1), 51–74 (2023)
Li, A., Zheng, B., Pekhimenko, G., Long, F.: Automatic horizontal fusion for GPU kernels. In: 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), IEEE, pp 14–27 (2022)
Liu, X., Guo, Y., Zhang, W., Wu, D., Huang, R., Yang, M., Lu, B.: Dynamic formation characteristics and mechanism of hybrid laser arc welding surface layer by Ni-based filler metal based on rotating laser induction. J. Mater. Res. Technol. 20, 3600–3615 (2022)
Membarth, R., Reiche, O., Hannig, F., Teich, J., Körner, M., Eckert, W.: Hipa cc: A domain-specific language and compiler for image processing. IEEE Trans. Parallel Distrib. Syst. 27(1), 210–224 (2015)
Pratt-Szeliga, PC., Fawcett, JW., Welch, RD.: Rootbeer: Seamlessly using gpus from java. In: 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, IEEE, pp 375–380 (2012)
Qiao, B., Özkan, MA., Teich, J., Hannig, F.: The best of both worlds: combining CUDA graph with an image processing DSL. In: 2020 57th ACM/IEEE Design Automation Conference (DAC), IEEE, pp 1–6 (2020)
Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S.: Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices 48(6), 519–530 (2013)
Reinke, P., Beckmann, T., Ahlers, C., Ahlrichs, J., Hammou, L., Schmidt, M.: High-speed digital photography of vapor cavitation in a narrow gap flow. Fluids 8(2), 44 (2023)
Scime, L., Fisher, B., Beuth, J.: Using coordinate transforms to improve the utility of a fixed field of view high speed camera for additive manufacturing applications. Manuf. Lett. 15, 104–106 (2018)
Sepasgozar, S.M., Shi, A., Yang, L., Shirowzhan, S., Edwards, D.J.: Additive manufacturing applications for industry 4.0: a systematic critical review. Buildings 10(12), 231 (2020)
Steinberger, M., Kenzel, M., Boechat, P., Kerbl, B., Dokter, M., Schmalstieg, D.: Whippletree: task-based scheduling of dynamic workloads on the GPU. ACM Trans. Graph. (TOG) 33(6), 1–11 (2014)
Truong, L., Barik, R., Totoni, E., Liu, H., Markley, C., Fox, A., Shpeisman, T.: Latte: A language, compiler, and runtime for elegant and efficient deep neural networks. In: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp 209–223, (2016)
Varga, M., Ventura, Cervellón, A., Leroch, S., Eder, S., Rojacz, H., Rodríguez Ripoll, M.: Fundamental abrasive contact at high speeds: scratch testing in experiment and simulation. In: Wear 522:204696, 24th International Conference on Wear of Materials (2023)
Vasilache, N., Zinenko, O., Theodoridis, T., Goyal, P., DeVito, Z., Moses, WS., Verdoolaege, S., Adams, A., Cohen, A.: Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730 (2018)
Wienke, S., Springer, P., Terboven, C., an Mey, D.: OpenACC-first experiences with real-world applications. In: Euro-Par 2012 Parallel Processing: 18th International Conference, Euro-Par 2012, Rhodes Island, Greece, August 27-31, (2012). Proceedings 18, pp 859–870. Springer (2012)
Xiao, S., Feng, Wc.: Inter-block GPU communication via fast barrier synchronization. In: 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), IEEE, pp 1–12 (2010)
Zhang, L., Wahib, M., Chen, P., Meng, J., Wang, X., Matsuoka, S.: Persistent Kernels for Iterative Memory-bound GPU Applications. arXiv preprint arXiv:2204.02064 (2022)
Zou, A., Li, J., Gill, CD., Zhang, X.: RTGPU: Real-time GPU scheduling of hard deadline parallel tasks with fine-grain utilization. IEEE Trans. Parallel Distrib. Syst. (2023)
Acknowledgements
This work is financially supported by the VLAIO ICON project‘Vision in the Loop’(HBC.2019.2808), a collaboration between imec, Flanders Make, Materialise, Dekimo, ESMA and AdditiveLab. The RTX A6000 GPU used for this research was donated by the NVIDIA Corporation.
Author information
Authors and Affiliations
Contributions
MN and BG designed and developed the optimization methodology. MN carried out the experiments. MN wrote the manuscript with support from BG and BGB. BG and BGB supervised the project. All authors reviewed the manuscript
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Nourazar, M., Booth, B.G. & Goossens, B. A GPU optimization workflow for real-time execution of ultra-high frame rate computer vision applications. J Real-Time Image Proc 21, 5 (2024). https://doi.org/10.1007/s11554-023-01384-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11554-023-01384-7