[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

IR2VEC: LLVM IR Based Scalable Program Embeddings

Published: 22 December 2020 Publication History

Abstract

We propose IR2VEC, a Concise and Scalable encoding infrastructure to represent programs as a distributed embedding in continuous space. This distributed embedding is obtained by combining representation learning methods with flow information to capture the syntax as well as the semantics of the input programs. As our infrastructure is based on the Intermediate Representation (IR) of the source code, obtained embeddings are both language and machine independent. The entities of the IR are modeled as relationships, and their representations are learned to form a seed embedding vocabulary. Using this infrastructure, we propose two incremental encodings: Symbolic and Flow-Aware. Symbolic encodings are obtained from the seed embedding vocabulary, and Flow-Aware encodings are obtained by augmenting the Symbolic encodings with the flow information.
We show the effectiveness of our methodology on two optimization tasks (Heterogeneous device mapping and Thread coarsening). Our way of representing the programs enables us to use non-sequential models resulting in orders of magnitude of faster training time. Both the encodings generated by IR2VEC outperform the existing methods in both the tasks, even while using simple machine learning models. In particular, our results improve or match the state-of-the-art speedup in 11/14 benchmark-suites in the device mapping task across two platforms and 53/68 benchmarks in the thread coarsening task across four different platforms. When compared to the other methods, our embeddings are more scalable, is non-data-hungry, and has better Out-Of-Vocabulary (OOV) characteristics.

References

[1]
Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. 2014. Learning natural coding conventions. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2014). ACM, 281--293.
[2]
Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. 2015. Suggesting accurate method and class names. In Proc. of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015). ACM, 38--49.
[3]
Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51, 4 (2018), 81.
[4]
Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to represent programs with graphs. In Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id=BJOFETxR-
[5]
Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2006. Compilers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA.
[6]
Miltiadis Allamanis, Hao Peng, and Charles A. Sutton. 2016. A convolutional attention network for extreme summarization of source code. In Proceedings of the 33nd International Conference on Machine Learning, (ICML 2016). 2091--2100. http://proceedings.mlr.press/v48/allamanis16.html.
[7]
Miltiadis Allamanis, Daniel Tarlow, Andrew Gordon, and Yi Wei. 2015. Bimodal modelling of source code and natural language. In Proceedings of the 32nd International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 37. PMLR, Lille, France, 2123--2132. http://proceedings.mlr.press/v37/allamanis15.html.
[8]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2018. A general path-based representation for predicting program properties. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). ACM, New York, 404--419.
[9]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. Code2Vec: Learning distributed representations of code. In Proceedings of the ACM Conference on Programming Languages (POPL), Vol. 3, Article 40 (Jan. 2019), 29 pages.
[10]
Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 8 (Aug. 2013), 1798--1828.
[11]
Alexander Brauckmann, Andrés Goens, Sebastian Ertel, and Jeronimo Castrillon. 2020. Compiler-based graph representations for deep learning models of code. In Proceedings of the 29th International Conference on Compiler Construction (Feb. 2020), 201--211.
[12]
James Bucek, Klaus-Dieter Lange, and Jóakim v. Kistowski. 2018. SPEC CPU2017: Next-generation compute benchmark. In Companion of the 2018 ACM/SPEC International Conference on Performance Engineering (ICPE’18). ACM, New York, 41--42.
[13]
Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. 2018. Neural code comprehension: A learnable representation of code semantics. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18). Curran Associates Inc., 3589--3601. http://dl.acm.org/citation.cfm?id=3327144.3327276.
[14]
Sushil Bajracharya, Joel Ossher, and Cristina Lopes. 2014. Sourcerer: An infrastructure for large-scale collection and analysis of open-source code. Sci. Comput. Program. 79 (Jan. 2014), 241--259.
[15]
Boost. 2018. Boost C++ Libraries. https://www.boost.org/. Accessed 2019-05-16.
[16]
Antoine Bordes, Nicolas Usunier, Alberto Garcia-Durán, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’13). Curran Associates Inc., 2787--2795. http://dl.acm.org/citation.cfm?id=2999792.2999923.
[17]
Chris Cummins, Zacharias V. Fisches, Tal Ben-Nun, Torsten Hoefler, and Hugh Leather. 2020. ProGraML: Graph-based deep learning for program optimization and analysis. arXiv preprint arXiv:2003.10536 (2020).
[18]
Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems (TOPLAS) 13, 4 (1991), 451--490.
[19]
Jose Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. 2019. When deep learning met code search. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). Association for Computing Machinery, New York, 964--974.
[20]
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms, Third Edition (3rd ed.). The MIT Press.
[21]
Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. 2017. End-to-end deep learning of optimization heuristics. In Proceedings of the 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 219--232.
[22]
Keith D. Cooper, Devika Subramanian, and Linda Torczon. 2002. Adaptive optimizing compilers for the 21st century. J. Supercomput. 23, 1 (Aug. 2002), 7--22.
[23]
Grigori Fursin, Yuriy Kashnikov, Abdul Wahid Memon, Zbigniew Chamski, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Bilha Mendelson, Ayal Zaks, Eric Courtois, Francois Bodin, Phil Barnard, Elton Ashton, Edwin Bonilla, John Thomson, Christopher K. I. Williams, and Michael O’Boyle. 2011. Milepost GCC: Machine learning enabled self-tuning compiler. International Journal of Parallel Programming 39, 3 (01 Jun 2011), 296--327.
[24]
GeeksforGeeks. 2003. C/C++/Java programs. https://www.geeksforgeeks.org. Accessed 2019-08-15.
[25]
Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. 2017. DeepFix: Fixing common C language errors by deep learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI’17). AAAI Press, 1345--1351. http://dl.acm.org/citation.cfm?id=3298239.3298436
[26]
Dominik Grewe, Zheng Wang, and Michael F. P. O’Boyle. 2013. Portable mapping of data parallel programs to OpenCL for heterogeneous systems. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, (CGO 2013), (Shenzhen, China, February 23-27), 2013. IEEE Computer Society, 22:1--22:10.
[27]
Ameer Haj-Ali, Nesreen K. Ahmed, Ted Willke, Yakun Sophia Shao, Krste Asanovic, and Ion Stoica. 2020. NeuroVectorizer: End-to-end vectorization with deep reinforcement learning. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO 2020). Association for Computing Machinery, New York, NY, USA, 242--255.
[28]
Xu Han, Shulin Cao, Xin Lv, Yankai Lin, Zhiyuan Liu, Maosong Sun, and Juanzi Li. 2018. OpenKE: An open toolkit for knowledge embedding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Brussels, Belgium, 139--144.
[29]
Matthew S. Hecht. 1977. Flow Analysis of Computer Programs. Elsevier Science Inc., New York, NY, USA.
[30]
Jordan Henkel, Shuvendu K. Lahiri, Ben Liblit, and Thomas Reps. 2018. Code vectors: Understanding programs through embedded abstracted symbolic traces. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018). ACM, NY, USA, 163--174.
[31]
S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer. 2016. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2073--2083.
[32]
G. Ji, S. He, L. Xu, K. Liu, and J. Zhao. 2015. Knowledge graph embedding via dynamic mapping matrix. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Beijing, China, 687--696.
[33]
P. J. Joseph, Matthew T. Jacob, Y. N. Srikant, and Kapil Vaswani. 2007. Statistical and machine learning techniques in compiler design. In The Compiler Design Handbook: Optimizations and Machine Code Generation, Second Edition, Y. N. Srikant and Priti Shankar (Eds.). CRC Press.
[34]
V. Kashyap, D. B. Brown, B. Liblit, D. Melski, and T. Reps. 2017. Source forager: A search engine for similar source code. arXiv preprint arXiv:1706.02769 (2017).
[35]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization. IEEE Computer Society, 75.
[36]
Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu. 2015. Learning entity and relation embeddings for knowledge graph completion. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15). AAAI Press, 2181--2187.
[37]
LLVM. 2018. LLVM Language Reference. https://llvm.org/docs/LangRef.html. Accessed 2019-08-20.
[38]
Sifei Luan, Di Yang, Celeste Barnaby, Koushik Sen, and Satish Chandra. 2019. Aroma: Code recommendation via structural code search. Proc. ACM Program. Lang. 3, (OOPSLA), Article 152 (Oct. 2019), 28 pages.
[39]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[40]
Alberto Magni, Christophe Dubach, and Michael F. P. O’Boyle. 2013. A large-scale cross-architecture evaluation of thread-coarsening. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’13, Denver, CO, USA - November 17-21, 2013). William Gropp and Satoshi Matsuoka (Eds.). ACM, 11:1--11:11.
[41]
Alberto Magni, Christophe Dubach, and Michael O’Boyle. 2014. Automatic optimization of thread-coarsening for graphics processors. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation. ACM, 455--466.
[42]
L. Mou, G. Li, L Zhang, T. Wang, and Z. Jin. 2016. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the 13th AAAI Conference on Artificial Intelligence (AAAI’16). AAAI Press, 1287--1293.
[43]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’13). Curran Associates Inc., 3111--3119. http://dl.acm.org/citation.cfm?id=2999792.2999959
[44]
Steven S. Muchnick. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA.
[45]
C. Mendis, C. Yang, Y. Pu, S. Amarasinghe, and M. Carbin. 2019. Compiler auto-vectorization with imitation learning. In Advances in Neural Information Processing Systems 32 (NeurIPS). Curran Associates, Inc., 14598--14609.
[46]
Michael Pradel and Koushik Sen. 2018. DeepBugs: A learning approach to name-based bug detection. In Proceedings of the ACM Symposium on Programming Languages, Volume 2. (OOPSLA), Article 147 (Oct. 2018), 25 pages.
[47]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532--1543.
[48]
H. G. Rice. 1953. Classes of recursively enumerable sets and their decision problems. Trans. Amer. Math. Soc. 74, 2 (1953), 358--366. http://www.jstor.org/stable/1990888.
[49]
R. Rolim, G. Soares, L. D’Antoni, O. Polozov, S. Gulwani, R. Gheyi, R. Suzuki, and B. Hartmann. 2017. Learning syntactic program transformations from examples. In Proceedings of the 39th International Conference on Software Engineering (ICSE’17). IEEE Press, Piscataway, N. J., 404--415.
[50]
Maxim Rabinovich, Mitchell Stern, and Dan Klein. 2017. Abstract syntax networks for code generation and semantic parsing. arXiv preprint arXiv:1704.07535 (2017).
[51]
V. Raychev, M. Vechev, and A. Krause. 2015. Predicting program properties from “big code”. In Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL’15). ACM, New York, 111--124.
[52]
N. Rosenblum, X. Zhu, and B. P. Miller. 2011. Who wrote this code? Identifying the authors of program binaries. In Proceedings of the 16th European Conference on Research in Computer Security (ESORICS’11). Springer-Verlag, Berlin, 172--189. http://dl.acm.org/citation.cfm?id=2041225.2041239
[53]
M. Stephenson and S. Amarasinghe. 2005. Predicting unroll factors using supervised classification. In Proceedings of the International Symposium on Code Generation and Optimization. 123--134.
[54]
Douglas Simon, John Cavazos, Christian Wimmer, and Sameer Kulkarni. 2013. Automatic construction of inlining heuristics using machine learning In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). IEEE Computer Society, 1--12.
[55]
Nicolai Stawinoga and Tony Field. 2018. Predictable thread coarsening. ACM Trans. Arch. Code Optim. 15, 2, Article 23 (June 2018), 26 pages.
[56]
I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (NIPS’14). MIT Press, Cambridge, MA, 3104--3112. http://dl.acm.org/citation.cfm?id=2969033.2969173.
[57]
Vasily Volkov and James W. Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC’08). IEEE Press, Article 31, 11 pages.
[58]
Martin Vechev and Eran Yahav. 2016. Programming with “big code ”. Found. Trends Program. Lang. 3, 4 (Dec. 2016), 231--284.
[59]
S. Wang, D. Chollak, D. Movshovitz-Attias, and L. Tan. 2016. Bugram: Bug detection with N-gram language models. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE 2016). ACM, New York, 708--719.
[60]
Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal component analysis. Chemometrics and Intelligent Laboratory Systems 2, 1--3 (1987), 37--52.
[61]
Ke Wang, Rishabh Singh, and Zhendong Su. 2018. Dynamic neural program embeddings for program repair. In Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), (Vancouver, Canada, Apr 30 - May 3, 2018). https://openreview.net/forum?id=BJuWrGW0Z.
[62]
Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge graph embedding by translating on hyperplanes. In Proceedings of the 28th AAAI Conference on Artificial Intelligence (AAAI’14). AAAI Press, 1112--1119.
[63]
Y. Xiong, J. Wang, R. Yan, J. Zhang, S. Han, G. Huang, and L. Zhang. 2017. Precise condition synthesis for program repair. In Proceedings of the 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). 416--426.

Cited By

View all
  • (2025)ARS-Flow 2.0: An enhanced design space exploration flow for accelerator-rich system based on active learningIntegration10.1016/j.vlsi.2024.102315101(102315)Online publication date: Mar-2025
  • (2025)PragFormer: Data-Driven Parallel Source Code Classification with TransformersInternational Journal of Parallel Programming10.1007/s10766-024-00778-953:1Online publication date: 1-Feb-2025
  • (2024)HPC-Coder: Modeling Parallel Programs using Large Language ModelsISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528929(1-12)Online publication date: May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 17, Issue 4
December 2020
430 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3427420
Issue’s Table of Contents
© 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 December 2020
Accepted: 01 August 2020
Revised: 01 July 2020
Received: 01 February 2020
Published in TACO Volume 17, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. LLVM
  2. compiler optimizations
  3. heterogeneous systems
  4. intermediate representations
  5. representation learning

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Visvesvaraya Young Faculty Research Fellowship from MeitY
  • NSM
  • Visvesvaraya PhD Scheme under the MEITY, GoI
  • AMD
  • Department of Electronics 8 Information Technology and the Ministry of Communications 8 Information Technology, Government of India

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,041
  • Downloads (Last 6 weeks)109
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)ARS-Flow 2.0: An enhanced design space exploration flow for accelerator-rich system based on active learningIntegration10.1016/j.vlsi.2024.102315101(102315)Online publication date: Mar-2025
  • (2025)PragFormer: Data-Driven Parallel Source Code Classification with TransformersInternational Journal of Parallel Programming10.1007/s10766-024-00778-953:1Online publication date: 1-Feb-2025
  • (2024)HPC-Coder: Modeling Parallel Programs using Large Language ModelsISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528929(1-12)Online publication date: May-2024
  • (2024)AdaCCDProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i16.29749(17942-17950)Online publication date: 20-Feb-2024
  • (2024)MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance OptimizationsProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676895(156-167)Online publication date: 14-Oct-2024
  • (2024)G2PM: Performance Modeling for ACAP Architecture with Dual-Tiered Graph Representation LearningProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3655898(1-6)Online publication date: 23-Jun-2024
  • (2024)Exponentially Expanding the Phase-Ordering Search Space via Dormant InformationProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641582(250-261)Online publication date: 17-Feb-2024
  • (2024)The Next 700 ML-Enabled Compiler OptimizationsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641580(238-249)Online publication date: 17-Feb-2024
  • (2024)Prism: Decomposing Program Semantics for Code Clone Detection through CompilationProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639129(1-13)Online publication date: 20-May-2024
  • (2024)Dataflow Analysis-Inspired Deep Learning for Efficient Vulnerability DetectionProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3623345(1-13)Online publication date: 20-May-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media