[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3453483.3454083acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article
Public Access

DNNFusion: accelerating deep neural networks execution with advanced operator fusion

Published: 18 June 2021 Publication History

Abstract

Deep Neural Networks (DNNs) have emerged as the core enabler of many major applications on mobile devices. To achieve high accuracy, DNN models have become increasingly deep with hundreds or even thousands of operator layers, leading to high memory and computational requirements for inference. Operator fusion (or kernel/layer fusion) is key optimization in many state-of-the-art DNN execution frameworks, such as TensorFlow, TVM, and MNN, that aim to improve the efficiency of the DNN inference. However, these frameworks usually adopt fusion approaches based on certain patterns that are too restrictive to cover the diversity of operators and layer connections, especially those seen in many extremely deep models. Polyhedral-based loop fusion techniques, on the other hand, work on a low-level view of the computation without operator-level information, and can also miss potential fusion opportunities. To address this challenge, this paper proposes a novel and extensive loop fusion framework called DNNFusion. The basic idea of this work is to work at an operator view of DNNs, but expand fusion opportunities by developing a classification of both individual operators and their combinations. In addition, DNNFusion includes 1) a novel mathematical-property-based graph rewriting framework to reduce evaluation costs and facilitate subsequent operator fusion, 2) an integrated fusion plan generation that leverages the high-level analysis and accurate light-weight profiling, and 3) additional optimizations during fusion code generation. DNNFusion is extensively evaluated on 15 DNN models with varied types of tasks, model sizes, and layer counts. The evaluation results demonstrate that DNNFusion finds up to 8.8 × higher fusion opportunities, outperforms four state-of-the-art DNN execution frameworks with 9.3× speedup. The memory requirement reduction and speedups can enable the execution of many of the target models on mobile devices and even make them part of a real-time application.

References

[1]
Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In OSDI 2016. 265–283.
[2]
Aravind Acharya, Uday Bondhugula, and Albert Cohen. 2018. Polyhedral Auto-Transformation with No Integer Linear Programming. In PLDI 2018. Association for Computing Machinery, New York, NY, USA. 529–542. isbn:9781450356985
[3]
Aravind Acharya, Uday Bondhugula, and Albert Cohen. 2020. Effective Loop Fusion in Polyhedral Compilation Using Fusion Conflict Graphs. ACM Transactions on Architecture and Code Optimization (TACO), 17, 4 (2020), 1–26. https://doi.org/10.1145/3416510
[4]
Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: A polyhedral compiler for expressing fast and portable code. In CGO 2019. 193–205.
[5]
Dan Benanav, Deepak Kapur, and Paliath Narendran. 1987. Complexity of matching problems. Journal of symbolic computation, 3, 1-2 (1987), 203–216.
[6]
Sourav Bhattacharya and Nicholas D Lane. 2016. From smart to deep: Robust activity recognition on smartwatches using deep learning. In 2016 IEEE International Conference on Pervasive Computing and Communication Workshops (PerCom Workshops). 1–6.
[7]
Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. 2020. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv preprint arXiv:2004.10934.
[8]
Matthias Boehm, Berthold Reinwald, Dylan Hutchison, Prithviraj Sen, Alexandre V. Evfimievski, and Niketan Pansare. 2018. On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML. Proc. VLDB Endow., 11, 12 (2018), Aug., 1755–1768. issn:2150-8097
[9]
Uday Bondhugula, Oktay Gunluk, Sanjeeb Dash, and Lakshminarayanan Renganarayanan. 2010. A Model for Fusion and Code Motion in an Automatic Parallelizing Compiler. In PACT 2010. ACM, 343–352. isbn:9781450301787 https://doi.org/10.1145/1854273.1854317
[10]
Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In PLDI 2008. 101–113.
[11]
Prasanth Chatarasi, Jun Shirako, and Vivek Sarkar. 2015. Polyhedral optimizations of explicitly parallel programs. In PACT 2015. 213–226. https://doi.org/10.1109/PACT.2015.44
[12]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In OSDI 2018. 578–594.
[13]
Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to optimize tensor programs. arXiv preprint arXiv:1805.08166.
[14]
Keith D Cooper, L Taylor Simpson, and Christopher A Vick. 2001. Operator strength reduction. ACM Transactions on Programming Languages and Systems (TOPLAS), 23, 5 (2001), 603–625.
[15]
Alain Darte. 1999. On the complexity of loop fusion. In 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No. PR00425). 149–157.
[16]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, and Ke Yang. 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223–1231.
[17]
Saumya K Debray. 1988. Unfold/fold transformations and loop optimization of logic programs. In PLDI 1988. 297–307.
[18]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR 2009. 248–255.
[19]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[20]
Johannes Doerfert, Kevin Streit, Sebastian Hack, and Zino Benaissa. 2015. Polly’s polyhedral scheduling in the presence of reductions. arXiv preprint arXiv:1505.07716.
[21]
Peiyan Dong, Siyue Wang, Wei Niu, Chengming Zhang, Sheng Lin, Zhengang Li, Yifan Gong, Bin Ren, Xue Lin, and Dingwen Tao. 2020. RTMobile: Beyond Real-Time Mobile Acceleration of RNNs for Speech Recognition. In 57th ACM/IEEE Design Automation Conference. IEEE, 1–6.
[22]
Tarek Elgamal, Shangyu Luo, Matthias Boehm, Alexandre V Evfimievski, Shirish Tatikonda, Berthold Reinwald, and Prithviraj Sen. 2017. SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale Machine Learning. In CIDR.
[23]
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2007. The PASCAL visual object classes challenge 2007 (VOC2007) results.
[24]
Pratik Fegade, Tianqi Chen, Phil Gibbons, and Todd Mowry. 2020. Cortex: A Compiler for Recursive Deep Learning Models. arXiv preprint arXiv:2011.01383.
[25]
Sridhar Gopinath, Nikhil Ghanathe, Vivek Seshadri, and Rahul Sharma. 2019. Compiling KB-sized machine learning models to tiny IoT devices. In PLDI 2019. 79–95. https://doi.org/10.1145/3314221.3314597
[26]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In ICCV 2017. 2961–2969.
[27]
Huawei. 2018. Kirin 980. https://consumer.huawei.com/en/campaign/kirin980
[28]
Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. 2019. TASO: optimizing deep learning computation with automatic generation of graph substitutions. In SOSP 2019. 47–62. https://doi.org/10.1145/3341301.3359630
[29]
Xiaotang Jiang, Huan Wang, Yiliu Chen, Ziqi Wu, Lichuan Wang, Bin Zou, Yafeng Yang, Zongyang Cui, Yu Cai, Tianhang Yu, Chengfei Lyu, and Zhihua Wu. 2020. MNN: A Universal and Efficient Inference Engine. In Proceedings of Machine Learning and Systems, I. Dhillon, D. Papailiopoulos, and V. Sze (Eds.). 2, 1–13.
[30]
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351.
[31]
Mahmut Kandemir, A Choudhary, J Ramanujam, and Prithviraj Banerjee. 1998. Improving locality using loop and data transformations in an integrated framework. In Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture. 285–296.
[32]
Ken Kennedy and Kathryn S McKinley. 1993. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In International Workshop on Languages and Compilers for Parallel Computing. 301–320.
[33]
Fredrik Kjolstad, Peter Ahrens, Shoaib Kamil, and Saman Amarasinghe. 2019. Tensor algebra compilation with workspaces. In CGO 2019. 180–192. https://doi.org/10.1109/CGO.2019.8661185
[34]
Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The tensor algebra compiler. Proceedings of the ACM on Programming Languages, 1, OOPSLA (2017), 1–29.
[35]
Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet, and Ponnuswamy Sadayappan. 2013. When polyhedral transformations meet SIMD code generation. In PLDI 2013. 127–138.
[36]
Emmanuel Kounalis and Denis Lugiez. 1991. Compilation of pattern matching with associative-commutative functions. In Colloquium on Trees in Algebra and Programming. 57–73.
[37]
Manuel Krebber. 2017. Non-linear associative-commutative many-to-one pattern matching with sequence variables. arXiv preprint arXiv:1705.00907.
[38]
Nikolaos Kyrtatas, Daniele G Spampinato, and Markus Püschel. 2015. A basic linear algebra compiler for embedded processors. In 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE). 1054–1059. https://doi.org/10.7873/DATE.2015.0182
[39]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In International Conference on Learning Representations.
[40]
Chris Lattner, Jacques Pienaar, Mehdi Amini, Uday Bondhugula, River Riddle, Albert Cohen, Tatiana Shpeisman, Andy Davis, Nicolas Vasilache, and Oleksandr Zinenko. 2020. MLIR: A Compiler Infrastructure for the End of Moore’s Law. arXiv preprint arXiv:2002.11054.
[41]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. 740–755.
[42]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector. In European conference on computer vision. 21–37. https://doi.org/10.1007/978-3-319-46448-0_2
[43]
Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. 2020. Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks. In OSDI 2020. USENIX Association, 881–897. isbn:978-1-939133-19-9
[44]
Nimrod Megiddo and Vivek Sarkar. 1997. Optimal weighted loop fusion for parallel programs. In Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures. 282–291.
[45]
Naums Mogers, Valentin Radu, Lu Li, Jack Turner, Michael O’Boyle, and Christophe Dubach. 2020. Automatic generation of specialized direct convolutions for mobile GPUs. In Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit. 41–50. https://doi.org/10.1145/3366428.3380771
[46]
Wei Niu, Xiaolong Ma, Sheng Lin, Shihao Wang, Xuehai Qian, Xue Lin, Yanzhi Wang, and Bin Ren. 2020. Patdnn: Achieving real-time DNN execution on mobile devices with pattern-based weight pruning. In ASPLOS 2020. 907–922. https://doi.org/10.1145/3373376.3378534
[47]
ONNX. 2017. Open Neural Network Exchange. https://www.onnx.ai
[48]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, and Luca Antiga. 2019. Pytorch: An imperative style, high-performance deep learning library. 8024–8035.
[49]
Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen, and John Cavazos. 2008. Iterative optimization in the polyhedral model: Part II, multidimensional time. ACM SIGPLAN Notices, 43, 6 (2008), 90–100.
[50]
Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, Jagannathan Ramanujam, Ponnuswamy Sadayappan, and Nicolas Vasilache. 2011. Loop transformations: convexity, pruning and optimization. ACM SIGPLAN Notices, 46, 1 (2011), 549–562.
[51]
Benoît Pradelle, Benoît Meister, Muthu Baskaran, Jonathan Springer, and Richard Lethin. 2017. Polyhedral optimization of TensorFlow computation graphs. In Programming and Performance Visualization Tools. Springer, 74–89. https://doi.org/10.1007/978-3-030-17872-7_5
[52]
Qualcomm. 2016. Snapdragon Profiler. https://developer.qualcomm.com/software/snapdragon-profiler
[53]
Qualcomm. 2018. Snapdragon 855. https://www.qualcomm.com/products/snapdragon-855-mobile-platform
[54]
Qualcomm. 2019. Snapdragon 865. https://www.qualcomm.com/products/snapdragon-865-5g-mobile-platform
[55]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1, 8 (2019), 9.
[56]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In PLDI 2013. Association for Computing Machinery, New York, NY, USA. 519–530. isbn:9781450320146
[57]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91–99.
[58]
Mary M Rodgers, Vinay M Pai, and Richard S Conroy. 2014. Recent advances in wearable sensors for health monitoring. IEEE Sensors Journal, 15, 6 (2014), 3119–3126.
[59]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. 234–241.
[60]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
[61]
Manuel Selva, Fabian Gruber, Diogo Sampaio, Christophe Guillon, Louis-Noël Pouchet, and Fabrice Rastello. 2019. Building a Polyhedral Representation from an Instrumented Execution: Making Dynamic Analyses of Nonaffine Programs Scalable. ACM Transactions on Architecture and Code Optimization (TACO), 16, 4 (2019), 1–26.
[62]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
[63]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
[64]
Daniele G Spampinato and Markus Püschel. 2014. A basic linear algebra compiler. In CGO’14. 23–32. https://doi.org/10.1145/2544137.2544155
[65]
Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. 2019. MobileBERT: Task-Agnostic Compression of BERT by Progressive Knowledge Transfer.
[66]
Mingxing Tan and Quoc V Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946.
[67]
TensorFlow. 2017. TensorFlow XLA. https://www.tensorflow.org/xla
[68]
TensorFlow. 2018. TensorFlow Grappler. https://www.tensorflow.org/guide/graph_optimization
[69]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In ICCV 2015. 4489–4497.
[70]
Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions. arXiv preprint arXiv:1802.04730.
[71]
Anand Venkat, Mary Hall, and Michelle Strout. 2015. Loop and data transformations for sparse matrix code. ACM SIGPLAN Notices, 50, 6 (2015), 521–532. https://doi.org/10.1145/2737924.2738003
[72]
Anand Venkat, Manu Shantharam, Mary Hall, and Michelle Mills Strout. 2014. Non-affine extensions to polyhedral code generation. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization. 185–194.
[73]
Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV 2018. 305–321.
[74]
Tomofumi Yuki, Vamshi Basupalli, Gautam Gupta, Guillaume Iooss, D Kim, Tanveer Pathan, Pradeep Srinivasa, Yun Zou, and Sanjay Rajopadhye. 2012. Alphaz: A system for analysis, transformation, and code generation in the polyhedral equational model. Colorado State University, Tech. Rep.

Cited By

View all
  • (2025)Resource-efficient Algorithms and Systems of Foundation Models: A SurveyACM Computing Surveys10.1145/370641857:5(1-39)Online publication date: 9-Jan-2025
  • (2025)AdaKnife: Flexible DNN Offloading for Inference Acceleration on Heterogeneous Mobile DevicesIEEE Transactions on Mobile Computing10.1109/TMC.2024.346693124:2(736-748)Online publication date: Feb-2025
  • (2025)ShaderNN: A lightweight and efficient inference engine for real-time applications on mobile GPUsNeurocomputing10.1016/j.neucom.2024.128628611(128628)Online publication date: Jan-2025
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation
June 2021
1341 pages
ISBN:9781450383912
DOI:10.1145/3453483
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Compiler Optimization
  2. Deep Neural Network
  3. Mobile Devices
  4. Operator Fusion

Qualifiers

  • Research-article

Funding Sources

Conference

PLDI '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,301
  • Downloads (Last 6 weeks)174
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Resource-efficient Algorithms and Systems of Foundation Models: A SurveyACM Computing Surveys10.1145/370641857:5(1-39)Online publication date: 9-Jan-2025
  • (2025)AdaKnife: Flexible DNN Offloading for Inference Acceleration on Heterogeneous Mobile DevicesIEEE Transactions on Mobile Computing10.1109/TMC.2024.346693124:2(736-748)Online publication date: Feb-2025
  • (2025)ShaderNN: A lightweight and efficient inference engine for real-time applications on mobile GPUsNeurocomputing10.1016/j.neucom.2024.128628611(128628)Online publication date: Jan-2025
  • (2024)MAGPYProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692034(683-698)Online publication date: 10-Jul-2024
  • (2024)MonoNNProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691991(989-1005)Online publication date: 10-Jul-2024
  • (2024)USHERProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691989(947-964)Online publication date: 10-Jul-2024
  • (2024)Design and Implementation of IP Operator Library as Backend of Neural Network on ZYNQ FPGAProceedings of the 1st International Workshop on Efficient Multimedia Computing under Limited10.1145/3688863.3689573(3-7)Online publication date: 28-Oct-2024
  • (2024)BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673046(576-586)Online publication date: 12-Aug-2024
  • (2024)Deep Learning Inference on Heterogeneous Mobile Processors: Potentials and PitfallsProceedings of the 2024 Workshop on Adaptive AIoT Systems10.1145/3662007.3663881(1-6)Online publication date: 3-Jun-2024
  • (2024)BOOM: Use your Desktop to Accurately Predict the Performance of Large Deep Neural NetworksProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676950(284-296)Online publication date: 14-Oct-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media