[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3581784.3607052acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

TrivialSpy: Identifying Software Triviality via Fine-grained and Dataflow-based Value Profiling

Published: 11 November 2023 Publication History

Abstract

Trivial operations cause software inefficiencies that waste functional units and memory bandwidth for executing useless instructions. Although previous works have identified a significant amount of trivial operations in widely used programs, the proposed solutions only provide useful observations, other than actionable guidance to eliminate trivial operations for better performance. In this paper, we propose TrivialSpy - a fine-grained and dataflow-based value profiler to effectively identify software triviality with optimization potential estimation. With the help of dataflow analysis, TrivialSpy can detect software trivialities of heavy operation, trivial chain, and redundant backward slice. In addition, TrivialSpy can identify trivial breakpoints that combine multiple trivial conditions for more optimization opportunities. The evaluation results demonstrate TrivialSpy is capable of identifying software triviality in highly optimized programs. Based on the optimization guidance provided by TrivialSpy, we can achieve 52.09% performance speedup at maximum after eliminating trivial operations.

Supplemental Material

MP4 File - SC23 paper presentation recording for "TrivialSpy: Identifying Software Triviality via Fine-grained and Dataflow-based Value Profiling"
SC23 paper presentation recording for "TrivialSpy: Identifying Software Triviality via Fine-grained and Dataflow-based Value Profiling", by Xin You, Hailong Yang, Kelun Lei, Zhongzhi Luan and Depei Qian

References

[1]
2017. CORAL-2 Benchmarks, https://asc.llnl.gov/coral-2-benchmarks/.
[2]
Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22, 6 (2010), 685--701.
[3]
Jennifer M Anderson, Lance M Berc, Jeffrey Dean, Sanjay Ghemawat, Monika R Henzinger, Shun-Tak A Leung, Richard L Sites, Mark T Vandevoorde, Carl A Waldspurger, and William E Weihl. 1997. Continuous profiling: Where have all the cycles gone? ACM Transactions on Computer Systems (TOCS) 15, 4 (1997), 357--390.
[4]
Ehsan Atoofian. 2020. Trivial bypassing in GPGPUs. IEEE Embedded Systems Letters 13, 1 (2020), 25--28.
[5]
Ehsan Atoofian and Amirali Baniasadi. 2005. Improving energy-efficiency by bypassing trivial computations. In 19th IEEE International Parallel and Distributed Processing Symposium. IEEE, 7--pp.
[6]
Ehsan Atoofian, Amirali Baniasadi, and Nikitas Dimopoulos. 2004. Improving performance by speculating trivializing operands in trivial instructions. In 2nd Value-Prediction and Value-Based Optimization Workshop, Boston, Massachusetts. 26--31.
[7]
Ehsan Atoofian, Zayan Shaikh, and Ali Jannesari. 2021. Reducing energy in GPGPUs through approximate trivial bypassing. ACM Transactions on Embedded Computing Systems (TECS) 20, 2 (2021), 1--27.
[8]
David H Bailey. 2011. NAS parallel benchmarks. Encyclopedia of Parallel Computing (2011), 1254--1259.
[9]
Preston Briggs, Keith D Cooper, and L Taylor Simpson. 1997. Value numbering. Software: Practice and Experience 27, 6 (1997), 701--724.
[10]
Derek Bruening and Saman Amarasinghe. 2004. Efficient, transparent, and comprehensive runtime code manipulation. Ph.D. Dissertation. Massachusetts Institute of Technology, Department of Electrical Engineering ....
[11]
James Bucek, Klaus-Dieter Lange, et al. 2018. Spec cpu2017: Next-generation compute benchmark. In Companion of the 2018 ACM/SPEC International Conference on Performance Engineering. ACM, 41--42.
[12]
Michael Burrows, Úlfar Erlingsson, ST A Leung, Mark T Vandevoorde, Carl A Waldspurger, Kip Walker, and William E Weihl. 2000. Efficient and flexible value sampling. ACM SIGARCH Computer Architecture News 28, 5 (2000), 160--167.
[13]
Brad Calder, Peter Feller, and Alan Eustace. 1997. Value profiling. In Proceedings of 30th Annual International Symposium on Microarchitecture. IEEE, 259--269.
[14]
Brad Calder, Peter Feller, Alan Eustace, et al. 1999. Value profiling and optimization. Journal of Instruction Level Parallelism 1, 1 (1999), 1--6.
[15]
Milind Chabbi and John Mellor-Crummey. 2012. Deadspy: a tool to pinpoint program inefficiencies. In Proceedings of the Tenth International Symposium on Code Generation and Optimization. 124--134.
[16]
Arnaldo Carvalho De Melo. 2010. The new linux'perf'tools. In Slides from Linux Kongress, Vol. 18.
[17]
Steven J Deitz, Bradford L Chamberlain, and Lawrence Snyder. 2001. Eliminating redundancies in sum-of-product array computations. In Proceedings of the 15th international conference on Supercomputing. 65--77.
[18]
Alberto Delmas Lascorz, Patrick Judd, Dylan Malone Stuart, Zissis Poulos, Mostafa Mahmoud, Sayeh Sharify, Milos Nikolic, Kevin Siu, and Andreas Moshovos. 2019. Bit-tactical: A software/hardware approach to exploiting value and bit sparsity in neural networks. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 749--763.
[19]
Luiz DeRose, Bill Homer, Dean Johnson, Steve Kaufmann, and Heidi Poxon. 2008. Cray performance analysis tools. In Tools for High Performance Computing. Springer, 191--199.
[20]
Mary F Fernandez. 1995. Simple and effective link-time optimization of Modula-3 programs. In Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation. 103--115.
[21]
A Fog. 2019. Software optimization resources, Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. Fog. Regime of access: http://www.agner.org/optimize (2019).
[22]
Zhangxiaowen Gong, Houxiang Ji, Christopher W Fletcher, Christopher J Hughes, and Josep Torrellas. 2020. SparseTrain: Leveraging Dynamic Sparsity in Software for Training DNNs on General-Purpose SIMD Processors. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 279--292.
[23]
Susan L Graham, Peter B Kessler, and Marshall K McKusick. 1982. Gprof: A call graph execution profiler. ACM Sigplan Notices 17, 6 (1982), 120--126.
[24]
Rajiv Gupta, Eduard Mehofer, and Youtao Zhang. 2002. Profile guided compiler optimizations. (2002).
[25]
Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. 2019. TASO: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 47--62.
[26]
Tyson Jones, Anna Brown, Ian Bush, and Simon C Benjamin. 2019. Quest and high performance simulation of quantum computers. Scientific reports 9, 1 (2019), 1--11.
[27]
Bumshik Lee, Jaehong Jung, and Munchurl Kim. 2016. An all-zero block detection scheme for low-complexity HEVC encoders. IEEE Transactions on Multimedia 18, 7 (2016), 1257--1268.
[28]
Guilherme Vieira Leobas and Fernando Magno Quintão Pereira. 2020. Semiring optimizations: dynamic elision of expressions with identity and absorbing elements. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020), 1--28.
[29]
Kevin M Lepak and Mikko H Lipasti. 2000. On the value locality of store instructions. In Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No. RS00201). IEEE, 182--191.
[30]
John Levon, Philippe Elie, et al. 2008. OProfile, a system-wide profiler for Linux systems. Homepage: http://oprofile.sourceforge.net (2008).
[31]
Bolun Li, Hao Xu, Qidong Zhao, Pengfei Su, Milind Chabbi, Shuyin Jiao, and Xu Liu. 2022. OJXPerf: Featherlight Object Replica Detection for Java Programs. In The International Conference on Software Engineering.
[32]
Kuo-You Peng, Sheng-Yu Fu, Yu-Ping Liu, and Wei-Chung Hsu. 2017. Adaptive runtime exploiting sparsity in tensor of deep learning neural network on heterogeneous systems. In 2017 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS). IEEE, 105--112.
[33]
Ram Rangan, Mark W Stephenson, Aditya Ukarande, Shyam Murthy, Virat Agarwal, and Marc Blackstein. 2020. Zeroploit: Exploiting zero valued operands in interactive gaming applications. ACM Transactions on Architecture and Code Optimization (TACO) 17, 3 (2020), 1--26.
[34]
James Reinders. 2005. VTune performance analyzer essentials. Intel Press (2005).
[35]
Stephen E Richardson. 1993. Exploiting trivial and redundant computation. In Proceedings of IEEE 11th Symposium on Computer Arithmetic. IEEE, 220--227.
[36]
Muhammad Aditya Sasongko, Milind Chabbi, Palwisha Akhtar, and Didem Unat. 2019. ComDetective: a lightweight communication detection tool for threads. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--21.
[37]
Muhammad Aditya Sasongko, Milind Chabbi, Mandana Bagheri Marzijarani, and Didem Unat. 2021. ReuseTracker: Fast Yet Accurate Multicore Reuse Distance Analyzer. ACM Transactions on Architecture and Code Optimization (TACO) 19, 1 (2021), 1--25.
[38]
Sameer S Shende and Allen D Malony. 2006. The TAU parallel performance system. The International Journal of High Performance Computing Applications 20, 2 (2006), 287--311.
[39]
Mark Stephenson and Ram Rangan. 2021. PGZ: automatic zero-value code specialization. In Proceedings of the 30th ACM SIGPLAN International Conference on Compiler Construction. 36--46.
[40]
Mark Stephenson, Ram Rangan, and Stephen W Keckler. 2021. Cooperative Profile Guided Optimizations. In Computer Graphics Forum, Vol. 40. Wiley Online Library, 71--83.
[41]
Pengfei Su, Shasha Wen, Hailong Yang, Milind Chabbi, and Xu Liu. 2019. Redundant loads: A software inefficiency indicator. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 982--993.
[42]
Jialiang Tan, Shuyin Jiao, Milind Chabbi, and Xu Liu. 2020. What every scientific programmer should know about compiler optimizations?. In Proceedings of the 34th ACM International Conference on Supercomputing. 1--12.
[43]
Shizhi Tang, Jidong Zhai, Haojie Wang, Lin Jiang, Liyan Zheng, Zhenhao Yuan, and Chen Zhang. 2022. FreeTensor: A Free-Form DSL with Holistic Optimizations for Irregular Tensor Programs. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (San Diego, CA, USA) (PLDI 2022). Association for Computing Machinery, New York, NY, USA, 872--887.
[44]
Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, Liyan Zheng, Yuanzhi Li, Kaiyuan Rong, Yuanyong Chen, and Zhihao Jia. 2021. PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections. In OSDI. 37--54.
[45]
Michael S Warren and John K Salmon. 1992. Astrophysical N-body simulations using hierarchical tree data structures. SC 92 (1992), 570--576.
[46]
Scott Watterson and Saumya Debray. 2001. Goal-directed value profiling. In International Conference on Compiler Construction. Springer, 319--333.
[47]
Mark N Wegman and F Kenneth Zadeck. 1991. Constant propagation with conditional branches. ACM Transactions on Programming Languages and Systems (TOPLAS) 13, 2 (1991), 181--210.
[48]
Shasha Wen, Milind Chabbi, and Xu Liu. 2017. Redspy: Exploring value locality in software. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. 47--61.
[49]
Shasha Wen, Xu Liu, John Byrne, and Milind Chabbi. 2018. Watching for software inefficiencies with witch. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. 332--347.
[50]
Shasha Wen, Xu Liu, and Milind Chabbi. 2015. Runtime value numbering: A profiling technique to pinpoint redundant computations. In 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 254--265.
[51]
Joshua J Yi and David J Lilja. 2002. Improving processor performance by simplifying and bypassing trivial computations. In Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors. IEEE, 462--465.
[52]
Xin You, Hailong Yang, Kelun Lei, Zhongzhi Luan, and Depei Qian. 2023. VClinic: A Portable and Efficient Framework for Fine-Grained Value Profilers. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 892--904.
[53]
Xin You, Hailong Yang, Zhongzhi Luan, Depei Qian, and Xu Liu. 2020. ZeroSpy: exploring software inefficiency with redundant zeros. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--14.
[54]
Qidong Zhao, Xu Liu, and Milind Chabbi. 2020. DrCCTProf: a fine-grained call path profiler for ARM-based clusters. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--16.
[55]
Ningxin Zheng, Bin Lin, Quanlu Zhang, Lingxiao Ma, Yuqing Yang, Fan Yang, Yang Wang, Mao Yang, and Lidong Zhou. 2022. {SparTA}:{Deep-Learning} Model Sparsity via {Tensor-with-Sparsity-Attribute}. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 213--232.
[56]
Keren Zhou, Yueming Hao, John Mellor-Crummey, Xiaozhu Meng, and Xu Liu. 2020. GVProf: A value profiler for GPU-based clusters. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--16.
[57]
Keren Zhou, Yueming Hao, John Mellor-Crummey, Xiaozhu Meng, and Xu Liu. 2022. ValueExpert: exploring value patterns in GPU-accelerated applications. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 171--185.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2023
1428 pages
ISBN:9798400701092
DOI:10.1145/3581784
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2023

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. dynamic binary instrumentation
  2. software triviality
  3. performance analysis
  4. performance optimization

Qualifiers

  • Research-article

Funding Sources

Conference

SC '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 316
    Total Downloads
  • Downloads (Last 12 months)254
  • Downloads (Last 6 weeks)6
Reflects downloads up to 19 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media