[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3611643.3616243acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article
Public Access

Baldur: Whole-Proof Generation and Repair with Large Language Models

Published: 30 November 2023 Publication History

Abstract

Formally verifying software is a highly desirable but labor-intensive task. Recent work has developed methods to automate formal verification using proof assistants, such as Coq and Isabelle/HOL, e.g., by training a model to predict one proof step at a time and using that model to search through the space of possible proofs. This paper introduces a new method to automate formal verification: We use large language models, trained on natural language and code and fine-tuned on proofs, to generate whole proofs at once. We then demonstrate that a model fine-tuned to repair generated proofs further increasing proving power. This paper: (1) Demonstrates that whole-proof generation using transformers is possible and is as effective but more efficient than search-based techniques. (2) Demonstrates that giving the learned model additional context, such as a prior failed proof attempt and the ensuing error message, results in proof repair that further improves automated proof generation. (3) Establishes, together with prior work, a new state of the art for fully automated proof synthesis. We reify our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs, empirically showing the effectiveness of whole-proof generation, repair, and added context. We also show that Baldur complements the state-of-the-art tool, Thor, by automatically generating proofs for an additional 8.7% of the theorems. Together, Baldur and Thor can prove 65.7% of the theorems fully automatically. This paper paves the way for new research into using large language models for automating formal verification.

Supplementary Material

Video (fse23main-p30-p-video.mp4)
"Formally verifying software properties is a highly desirable but labor-intensive task. Recent work has developed methods to automate formal verification using proof assistants, such as Coq and Isabelle/HOL, e.g., by training a model to predict one proof step at a time, and using that model to search through the space of possible proofs. This paper introduces a new method to automate formal verification: We use large language models, trained on natural language text and code and fine-tuned on proofs, to generate whole proofs for theorems at once, rather than one step at a time. We combine this proof generation model with a fine-tuned repair model to repair generated proofs, further increasing proving power. As its main contributions, this paper demonstrates for the first time that: (1) Whole-proof generation using transformers is possible and is as effective as search-based techniques without requiring costly search. (2) Giving the learned model additional context, such as a prior failed proof attempt and the ensuing error message, results in proof repair and further improves automated proof generation. (3) We establish a new state of the art for fully automated proof synthesis. We reify our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs. In addition to empirically showing the effectiveness of whole-proof generation, repair, and added context, we show that Baldur improves on the state-of-the-art tool, Thor, by automatically generating proofs for an additional 8.7% of the theorems. Together, Baldur and Thor can prove 65.7% of the theorems fully automatically. This paper paves the way for new research into using large language models for automating formal verification."

References

[1]
Afsoon Afzal, Manish Motwani, Kathryn T. Stolee, Yuriy Brun, and Claire Le Goues. 2021. SOSRepair: Expressive Semantic Search for Real-World Program Repair. IEEE TSE, 47, 10 (2021), October, 2162–2181. issn:0098-5589 https://doi.org/10.1109/TSE.2019.2944914
[2]
Arpan Agrawal, Emily First, Zhanna Kaufman, Tom Reichel, Shizhuo Zhang, Timothy Zhou, Alex Sanchez-Stern, Talia Ringer, and Yuriy Brun. 2023. Proofster: Automated Formal Verification. In ICSE Demo. Melbourne, Australia.
[3]
Fatmah Yousef Assiri and James M Bieman. 2017. Fault Localization for Automated Program Repair: Effectiveness, Performance, Repair Correctness. Software Quality Journal, 25, 1 (2017), 171–199. https://doi.org/10.1007/s11219-016-9312-z
[4]
Zhangir Azerbayev, Bartosz Piotrowski, and Jeremy Avigad. 2022. ProofNet: A benchmark for autoformalizing and formally proving undergraduate-level mathematics problems. In Workshop MATH-AI: Toward Human-Level Mathematical Reasoning. New Orleans, LA, USA.
[5]
Kshitij Bansal, Sarah M. Loos, Markus N. Rabe, Christian Szegedy, and Stewart Wilcox. 2019. HOList: An Environment for Machine Learning of Higher Order Logic Theorem Proving. In ICML. 97, PMLR, Long Beach, CA, USA. 454–463. http://proceedings.mlr.press/v97/bansal19a.html
[6]
Gilles Barthe, Boris Köpf, Federico Olmedo, and Santiago Zanella-Béguelin. 2013. Probabilistic Relational Reasoning for Differential Privacy. ACM TOPLAS, 35, 3 (2013), Nov., 9:2–9:49. https://doi.org/10.1145/2492061
[7]
Lasse Blaauwbroek, Josef Urban, and Herman Geuvers. 2020. The Tactician. In Intelligent Computer Mathematics. 271–277.
[8]
Sidney Black. 2022. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. In BigScience Episode #5 — Workshop on Challenges & Perspectives in Creating Large Language Models. 95–136. https://doi.org/10.18653/v1/2022.bigscience-1.9
[9]
Tom B. Brown. 2020. Language Models Are Few-Shot Learners. In NeurIPS.
[10]
Ahmet Celik, Karl Palmskog, and Milos Gligoric. 2017. ICoq: Regression proof selection for large-scale verification projects. In ASE. Urbana-Champaign, IL, USA. 171–182. https://doi.org/10.1109/ASE.2017.8115630
[11]
Ahmet Celik, Karl Palmskog, and Milos Gligoric. 2018. A Regression Proof Selection Tool for Coq. In ICSE Demo Track. Gothenburg, Sweden. 117–120. https://doi.org/10.1145/3183440.3183493
[12]
Ahmet Celik, Karl Palmskog, Marinela Parovic, Emilio Jesús Gallego Arias, and Milos Gligoric. 2019. Mutation Analysis for Coq. In ASE. San Diego, CA, USA. 539–551. https://doi.org/10.1109/ASE.2019.00057
[13]
Liushan Chen, Yu Pei, and Carlo A. Furia. 2017. Contract-based program repair without the contracts. In ASE. Urbana, IL, USA. 637–647.
[14]
Mark Chen. 2021. Evaluating Large Language Models Trained on Code. CoRR, abs/2107.03374 (2021), arxiv:2107.03374
[15]
Zimin Chen, Steve James Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, and Martin Monperrus. 2019. Sequencer: Sequence-to-sequence learning for end-to-end program repair. IEEE TSE, 47, 9 (2019), 1943–1959. https://doi.org/10.1109/TSE.2019.2940179
[16]
Aakanksha Chowdhery. 2022. PaLM: Scaling Language Modeling with Pathways. CoRR, abs/2204.02311 (2022), https://doi.org/10.48550/arXiv.2204.02311 arXiv:2204.02311.
[17]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems. CoRR, abs/2110.14168 (2021), arXiv:2110.14168. arxiv:2110.14168
[18]
Garett Cunningham, Razvan C. Bunescu, and David Juedes. 2023. Towards Autoformalization of Mathematics and Code Correctness: Experiments with Elementary Proofs. CoRR, abs/2301.02195 (2023), https://doi.org/10.48550/arXiv.2301.02195
[19]
Ł ukasz Czajka and Cezary Kaliszyk. 2018. Hammer for Coq: Automation for Dependent Type Theory. Journal of Automated Reasoning, 61, 1-4 (2018), 423–453. https://doi.org/10.1007/s10817-018-9458-4
[20]
Loris D’Antoni, Roopsha Samanta, and Rishabh Singh. 2016. Qlose: Program Repair with Quantitative Objectives. In CAV. Toronto, ON, Canada. 383–401.
[21]
Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical Neural Story Generation. In ACL. Melbourne, Australia. 889–898. https://doi.org/10.18653/v1/P18-1082
[22]
Emily First and Yuriy Brun. 2022. Diversity-Driven Automated Formal Verification. In ICSE. Pittsburgh, PA, USA. 749–761. https://doi.org/10.1145/3510003.3510138
[23]
Emily First, Yuriy Brun, and Arjun Guha. 2020. TacTok: Semantics-Aware Proof Synthesis. PACMPL OOPSLA, 4 (2020), November, 231:1–231:31. https://doi.org/10.1145/3428299
[24]
Emily First, Markus N. Rabe, Talia Ringer, and Yuriy Brun. 2023. Baldur: Whole-Proof Generation and Repair with Large Language Models. CoRR, abs/2303.04910 (2023), arxiv:2303.04910
[25]
Sainyam Galhotra, Yuriy Brun, and Alexandra Meliou. 2017. Fairness Testing: Testing Software for Discrimination. In ESEC/FSE. Paderborn, Germany. 498–510. https://doi.org/10.1145/3106237.3106277
[26]
Thibault Gauthier, Cezary Kaliszyk, Josef Urban, Ramana Kumar, and Michael Norrish. 2021. TacticToe: Learning to Prove with Tactics. Journal of Automated Reasoning, 65, 2 (2021), February, 257–286. issn:0168-7433 https://doi.org/10.1007/s10817-020-09580-x
[27]
Stephen Giguere, Blossom Metevier, Yuriy Brun, Bruno Castro da Silva, Philip S. Thomas, and Scott Niekum. 2022. Fairness Guarantees under Demographic Shift. In ICLR. https://openreview.net/forum?id=wbPObLm6ueA
[28]
Sumit Gulwani, Ivan Radiček, and Florian Zuleger. 2018. Automated Clustering and Program Repair for Introductory Programming Assignments. In PLDI. Philadelphia, PA, USA. 465–480. https://doi.org/10.1145/3192366.3192387
[29]
Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish K. Shevade. 2017. DeepFix: Fixing Common C Language Errors by Deep Learning. In AAAI. San Francisco, CA, USA. 1345–1351.
[30]
Christopher Hahn, Frederik Schmitt, Julia J. Tillman, Niklas Metzger, Julian Siber, and Bernd Finkbeiner. 2022. Formal Specifications from Natural Language. CoRR, abs/2206.01962 (2022), https://doi.org/10.48550/arXiv.2206.01962
[31]
Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward W. Ayers, and Stanislas Polu. 2022. Proof Artifact Co-Training for Theorem Proving with Language Models. In ICLR. https://openreview.net/forum?id=rpxJc9j04U
[32]
Mark Harman. 2007. The Current State and Future of Search Based Software Engineering. In ICSE. 342–357. https://doi.org/10.1109/FOSE.2007.29
[33]
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring Mathematical Problem Solving With the MATH Dataset. CoRR, abs/2103.03874 (2021), arxiv:2103.03874
[34]
Jónathan Heras and Ekaterina Komendantskaya. 2014. ACL2(ml): Machine-learning for ACL2. Electronic Proceedings in Theoretical Computer Science, 152 (2014), 04, https://doi.org/10.4204/EPTCS.152.5
[35]
Daniel Huang, Prafulla Dhariwal, Dawn Song, and Ilya Sutskever. [n. d.]. GamePad: A Learning Environment for Theorem Proving. In ICLR), year = 2019, url = https://openreview.net/forum?id=r1xwKoR9Y7,.
[36]
Kush Jain, Karl Palmskog, Ahmet Celik, Emilio Jesús Gallego Arias, and Milos Gligoric. 2020. MCoq: Mutation Analysis for Coq Verification Projects. In ICSE Demo Track. Seoul, South Korea. 89–92. https://doi.org/10.1145/3377812.3382156
[37]
Albert Jiang, Konrad Czechowski, Mateja Jamnik, Piotr Milos, Szymon Tworkowski, Wenda Li, and Yuhuai Tony Wu. 2022. Thor: Wielding Hammers to Integrate Language Models and Automated Theorem Provers. In NeurIPS.
[38]
Albert Qiaochu Jiang, Wenda Li, Jesse Michael Han, and Yuhuai Wu. 2021. LISA: Language models of ISAbelle proofs. In Conference on Artificial Intelligence and Theorem Proving (AITP). Aussois, France. 17.1–17.3.
[39]
Albert Q. Jiang, Sean Welleck, Jin Peng Zhou, Wenda Li, Jiacheng Liu, Mateja Jamnik, Timothée Lacroix, Yuhuai Wu, and Guillaume Lample. 2022. Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs. CoRR, abs/2210.12283 (2022), https://doi.org/10.48550/arXiv.2210.12283
[40]
Jiajun Jiang, Yingfei Xiong, and Xin Xia. 2019. A manual inspection of Defects4J bugs and its implications for automatic program repair. Science China Information Sciences, 62, 10 (2019), 200102.
[41]
Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, and Xiangqun Chen. 2018. Shaping Program Repair Space with Existing Patches and Similar Code. In ISSTA. Amsterdam, The Netherlands. 298–309. isbn:978-1-4503-5699-2 https://doi.org/10.1145/3213846.3213871
[42]
Yalin Ke, Kathryn T. Stolee, Claire Le Goues, and Yuriy Brun. 2015. Repairing Programs with Semantic Code Search. In ASE. Lincoln, NE, USA. 295–306. https://doi.org/10.1109/ASE.2015.60
[43]
Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013. Automatic patch generation learned from human-written patches. In ICSE. San Francisco, CA, USA. 802–811. isbn:978-1-4673-3076-3 https://doi.org/10.1109/ICSE.2013.6606626
[44]
Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick, David Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt, Rafal Kolanski, Michael Norrish, Thomas Sewell, Harvey Tuch, and Simon Winwood. 2009. SeL4: Formal Verification of an OS Kernel. In SOSP. Big Sky, Montana, USA. 207–220. https://doi.org/10.1145/1629575.1629596
[45]
Anil Koyuncu, Kui Liu, Tegawendé F Bissyandé, Dongsun Kim, Martin Monperrus, Jacques Klein, and Yves Le Traon. 2019. iFixR: Bug Report Driven Program Repair. In ESEC/FSE. 314–325. https://doi.org/10.1145/3338906.3338935
[46]
Guillaume Lample, Marie-Anne Lachaux, Thibaut Lavril, Xavier Martinet, Amaury Hayat, Gabriel Ebner, Aurélien Rodriguez, and Timothée Lacroix. 2022. HyperTree Proof Search for Neural Theorem Proving. CoRR, abs/2205.11491 (2022), https://doi.org/10.48550/arXiv.2205.11491
[47]
Leonidas Lampropoulos, Zoe Paraskevopoulou, and Benjamin C. Pierce. 2017. Generating Good Generators for Inductive Relations. PACMPL, 2, POPL (2017), Dec., 45:1–45:30. https://doi.org/10.1145/3158133
[48]
Xuan Bach D. Le, David Lo, and Claire Le Goues. 2016. History Driven Program Repair. In SANER. 1, 213–224. https://doi.org/10.1109/SANER.2016.76
[49]
Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2012. GenProg: A Generic Method for Automatic Software Repair. IEEE TSE, 38 (2012), 54–72. issn:0098-5589 https://doi.org/10.1109/TSE.2011.104
[50]
Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. 2019. Automated Program Repair. CACM, 62, 12 (2019), Nov., 56–65. issn:0001-0782 https://doi.org/10.1145/3318162
[51]
Xavier Leroy. 2009. Formal Verification of a Realistic Compiler. CACM, 52, 7 (2009), July, 107–115. issn:0001-0782 https://doi.org/10.1145/1538788.1538814
[52]
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V. Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. 2022. Solving Quantitative Reasoning Problems with Language Models. CoRR, abs/2206.14858 (2022), https://doi.org/10.48550/arXiv.2206.14858
[53]
Fan Long and Martin Rinard. 2016. Automatic Patch Generation by Learning Correct Code. In POPL. St. Petersburg, FL, USA. 298–312. https://doi.org/10.1145/2837614.2837617
[54]
Yiling Lou, Ali Ghanbari, Xia Li, Lingming Zhang, Haotian Zhang, Dan Hao, and Lu Zhang. 2020. Can Automated Program Repair Refine Fault Localization? A Unified Debugging Approach. In ISSTA. Virtual Event, USA. 75–87. https://doi.org/10.1145/3395363.3397351
[55]
Paolo Masci and Aaron Dutle. 2022. Proof Mate: An Interactive Proof Helper for PVS (Tool Paper). In NASA Formal Methods Symposium. 809–815.
[56]
Sergey Mechtaev, Manh-Dung Nguyen, Yannic Noller, Lars Grunske, and Abhik Roychoudhury. 2018. Semantic Program Repair Using a Reference Implementation. In ICSE. 129–139. isbn:978-1-4503-5638-1 https://doi.org/10.1145/3180155.3180247
[57]
Blossom Metevier, Stephen Giguere, Sarah Brockman, Ari Kobren, Yuriy Brun, Emma Brunskill, and Philip S. Thomas. 2019. Offline Contextual Bandits with High Probability Fairness Guarantees. In NeurIPS. Vancouver, BC, Canada. 14893–14904. http://papers.neurips.cc/paper/9630-offline-contextual-bandits-with-high-probability-fairness-guarantees
[58]
Manish Motwani and Yuriy Brun. 2023. Better Automatic Program Repair by Using Bug Reports and Tests Together. In ICSE. Melbourne, Australia.
[59]
Manish Motwani, Mauricio Soto, Yuriy Brun, René Just, and Claire Le Goues. 2022. Quality of Automated Program Repair on Real-World Defects. IEEE TSE, 48, 2 (2022), February, 637–661. issn:0098-5589 https://doi.org/10.1109/TSE.2020.2998785
[60]
Kıvanç Muşlu, Yuriy Brun, and Alexandra Meliou. 2013. Data Debugging with Continuous Testing. In ESEC/FSE New Ideas Track. Saint Petersburg, Russia. 631–634. https://doi.org/10.1145/2491411.2494580
[61]
Kıvanç Muşlu, Yuriy Brun, and Alexandra Meliou. 2015. Preventing Data Errors with Continuous Testing. In ISSTA. Baltimore, MD, USA. 373–384. https://doi.org/10.1145/2771783.2771792
[62]
Yutaka Nagashima and Yilun He. 2018. PaMpeR: Proof Method Recommendation System for Isabelle/HOL. In ASE. Montpellier, France. 362–372. isbn:9781450359375 https://doi.org/10.1145/3238147.3238210
[63]
Pengyu Nie, Karl Palmskog, Junyi Jessy Li, and Milos Gligoric. 2020. Deep Generation of Coq Lemma Names Using Elaborated Terms. In IJCAR. Paris, France. 97–118.
[64]
Pengyu Nie, Karl Palmskog, Junyi Jessy Li, and Milos Gligoric. 2020. Learning to Format Coq Code Using Language Models. In The Coq Workshop. Aubervilliers, France.
[65]
Pengyu Nie, Karl Palmskog, Junyi Jessy Li, and Milos Gligoric. 2021. Roosterize: Suggesting Lemma Names for Coq Verification Projects Using Deep Learning. In ICSE Demo Track. Madrid, Spain. 21–24. https://doi.org/10.1109/ICSE-Companion52605.2021.00026
[66]
Kunihiro Noda, Yusuke Nemoto, Keisuke Hotta, Hideo Tanida, and Shinji Kikuchi. 2020. Experience Report: How Effective is Automated Program Repair for Industrial Software? In SANER. 612–616.
[67]
Kimia Noorbakhsh, Modar Sulaiman, Mahdi Sharifi, Kallol Roy, and Pooyan Jamshidi. 2021. Pretrained Language Models are Symbolic Mathematics Solvers too!. CoRR, abs/2110.03501 (2021), arxiv:2110.03501
[68]
Maxwell I. Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. 2021. Show Your Work: Scratchpads for Intermediate Computation with Language Models. CoRR, abs/2112.00114 (2021).
[69]
Aditya Paliwal, Sarah M. Loos, Markus N. Rabe, Kshitij Bansal, and Christian Szegedy. 2020. Graph Representations for Higher-Order Logic and Theorem Proving. In AAAI and EAAI. New York, NY, USA. 2967–2974.
[70]
Karl Palmskog, Ahmet Celik, and Milos Gligoric. 2018. PiCoq: Parallel Regression Proving for Large-Scale Verification Projects. In ISSTA. Amsterdam, Netherlands. 344–355. https://doi.org/10.1145/3213846.3213877
[71]
Larry Paulson and Tobias Nipkow. 2023. The Sledgehammer: Let Automatic Theorem Provers write your Isabelle scripts!. https://isabelle.in.tum.de/website-Isabelle2009-1/sledgehammer.html
[72]
Justyna Petke and Aymeric Blot. 2018. Refining Fitness Functions in Test-Based Program Repair. In Workshop on Automated Program Repair (APR). Seoul, Republic of Korea. 13–14. https://doi.org/10.1145/3387940.3392180
[73]
Stanislas Polu and Ilya Sutskever. 2020. Generative Language Modeling for Automated Theorem Proving. CoRR, abs/2009.03393 (2020), arxiv:2009.03393
[74]
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2022. Efficiently Scaling Transformer Inference. CoRR, abs/2211.05102 (2022), https://doi.org/10.48550/arXiv.2211.05102
[75]
Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. 2015. An Analysis of Patch Plausibility and Correctness for Generate-and-validate Patch Generation Systems. In ISSTA. Baltimore, MD, USA. 24–36. isbn:978-1-4503-3620-8 https://doi.org/10.1145/2771783.2771791
[76]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21 (2020), 140:1–140:67. http://jmlr.org/papers/v21/20-074.html
[77]
Talia Ringer. 2021. Proof Repair. Ph. D. Dissertation. University of Washington.
[78]
Talia Ringer, Karl Palmskog, Ilya Sergey, Milos Gligoric, and Zachary Tatlock. 2019. QED at Large: A Survey of Engineering of Formally Verified Software. Foundations and Trends® in Programming Languages, 5, 2–3 (2019), 102–281.
[79]
Talia Ringer, Alex Sanchez-Stern, Dan Grossman, and Sorin Lerner. 2020. REPLica: REPL Instrumentation for Coq Analysis. In International Conference on Certified Programs and Proofs (CPP). New Orleans, LA, USA. 99––113. https://doi.org/10.1145/3372885.3373823
[80]
Ripon K. Saha, Yingjun Lyu, Hiroaki Yoshida, and Mukul R. Prasad. 2017. ELIXIR: Effective object oriented program repair. In ASE. 648–659.
[81]
Seemanta Saha, Ripon K. Saha, and Mukul R. Prasad. 2019. Harnessing Evolution for Multi-Hunk Program Repair. In ICSE. Montreal, QC, Canada. 13–24. https://doi.org/10.1109/ICSE.2019.00020
[82]
Alex Sanchez-Stern, Yousef Alhessi, Lawrence Saul, and Sorin Lerner. 2020. Generating Correctness Proofs with Neural Networks. In Proceedings of the 4th ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL 2020). London, UK. 1–10. isbn:9781450379960 https://doi.org/10.1145/3394450.3397466
[83]
Alex Sanchez-Stern, Emily First, Timothy Zhou, Zhanna Kaufman, Yuriy Brun, and Talia Ringer. 2023. Passport: Improving Automated Formal Verification Using Identifiers. ACM TOPLAS, 45, 2 (2023), Article 12, June, 12:1-12:30 pages. issn:0164-0925 https://doi.org/10.1145/3593374
[84]
Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need. CoRR, abs/1911.02150 (2019), https://doi.org/10.48550/arXiv.1911.02150
[85]
Edward K. Smith, Earl Barr, Claire Le Goues, and Yuriy Brun. 2015. Is the Cure Worse than the Disease? Overfitting in Automated Program Repair. In ESEC/FSE. 532–543. https://doi.org/10.1145/2786805.2786825
[86]
Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. 2021. RoFormer: Enhanced Transformer with Rotary Position Embedding. CoRR, abs/2104.09864 (2021), https://doi.org/10.48550/arXiv.2104.09864
[87]
Shuyao Sun, Junxia Guo, Ruilian Zhao, and Zheng Li. 2018. Search-Based Efficient Automated Program Repair Using Mutation and Fault Localization. In COMPSAC. 1, 174–183. https://doi.org/10.1109/COMPSAC.2018.00030
[88]
The Coq Development Team. 2017. Coq, v.8.7. https://coq.inria.fr
[89]
Philip S. Thomas, Bruno Castro da Silva, Andrew G. Barto, Stephen Giguere, Yuriy Brun, and Emma Brunskill. 2019. Preventing Undesirable Behavior of Intelligent Machines. Science, 366, 6468 (2019), November, 999–1004. issn:0036-8075 https://doi.org/10.1126/science.aag3311
[90]
Haoye Tian, Kui Liu, Abdoul Kader Kaboré, Anil Koyuncu, Li Li, Jacques Klein, and Tegawendé F. Bissyandé. 2020. Evaluating Representation Learning of Code Changes for Predicting Patch Correctness in Program Repair. In ASE. Melbourne, Australia. 981–992. https://doi.org/10.1145/3324884.3416532
[91]
Yuchi Tian and Baishakhi Ray. 2017. Automatically diagnosing and repairing error handling bugs in C. In ESEC/FSE. Paderborn, Germany. 752–762.
[92]
Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. 2016. WaveNet: A Generative Model for Raw Audio. CoRR, abs/1609.03499 (2016), https://doi.org/10.48550/arXiv.1609.03499
[93]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS.
[94]
Ke Wang, Rishabh Singh, and Zhendong Su. 2018. Search, align, and repair: Data-driven feedback generation for introductory programming exercises. In PLDI. Philadelphia, PA, USA. 481–495. https://doi.org/10.1145/3296979.3192384
[95]
Shangwen Wang, Ming Wen, Bo Lin, Hongjun Wu, Yihao Qin, Deqing Zou, Xiaoguang Mao, and Hai Jin. 2020. Automated Patch Correctness Assessment: How Far Are We? In ASE. Melbourne, Australia. 968–980. isbn:9781450367684 https://doi.org/10.1145/3324884.3416590
[96]
Aline Weber, Blossom Metevier, Yuriy Brun, Philip S. Thomas, and Bruno Castro da Silva. 2022. Enforcing Delayed-Impact Fairness Guarantees. CoRR, abs/2208.11744 (2022), https://doi.org/10.48550/arXiv.2208.11744
[97]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, Quoc Le, and Denny Zhou. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models. CoRR, abs/2201.11903 (2022), https://doi.org/10.48550/arXiv.2201.11903
[98]
Westley Weimer, Zachary P. Fry, and Stephanie Forrest. 2013. Leveraging Program Equivalence for Adaptive Program Repair: Models and First Results. In ASE. Palo Alto, CA, USA. 356–366. https://doi.org/10.1109/ASE.2013.6693094
[99]
Ming Wen, Junjie Chen, Rongxin Wu, Dan Hao, and Shing-Chi Cheung. 2018. Context-Aware Patch Generation for Better Automated Program Repair. In ICSE. Gothenburg, Sweden. 1–11. https://doi.org/10.1145/3180155.3180233
[100]
Minchao Wu, Michael Norrish, Christian Walder, and Amir Dezfouli. 2021. TacticZero: Learning to Prove Theorems from Scratch with Deep Learning. In NeurIPS. https://openreview.net/forum?id=edmYVRkYZv
[101]
Yuhuai Wu, Albert Q. Jiang, Wenda Li, Markus N. Rabe, Charles Staats, Mateja Jamnik, and Christian Szegedy. 2022. Autoformalization with Large Language Models. CoRR, abs/2205.12615 (2022), https://doi.org/10.48550/ARXIV.2205.12615
[102]
Qi Xin and Steven P. Reiss. 2017. Identifying Test-suite-overfitted Patches through Test Case Generation. In ISSTA. 226–236. isbn:978-1-4503-5076-1 https://doi.org/10.1145/3092703.3092718
[103]
Deheng Yang, Yuhua Qi, and Xiaoguang Mao. 2018. Evaluating the Strategies of Statement Selection in Automated Program Repair. In International Conference on Software Analysis, Testing, and Evolution (SATE). 33–48. https://doi.org/10.1007/978-3-030-04272-1_3
[104]
Jinqiu Yang, Alexey Zhikhartsev, Yuefei Liu, and Lin Tan. 2017. Better test cases for better automated program repair. In ESEC/FSE. Paderborn, Germany. 831–841. https://doi.org/10.1145/3106237.3106274
[105]
Kaiyu Yang and Jia Deng. 2019. Learning to prove theorems via interacting with proof assistants. In ICML. 6984–6994.
[106]
Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. 2011. Finding and understanding bugs in C compilers. In PLDI. San Jose, CA, USA. 283–294. isbn:978-1-4503-0663-8 https://doi.org/10.1145/1993498.1993532
[107]
Michihiro Yasunaga and Percy Liang. 2021. Break-it-fix-it: Unsupervised learning for program repair. In ICML. 11941–11952.
[108]
He Ye, Matias Martinez, and Martin Monperrus. 2021. Automated patch assessment for program repair at scale. Empirical Software Engineering (EMSE), 26, 2 (2021), 1–38. https://doi.org/10.1007/s10664-020-09920-w
[109]
Zhongxing Yu, Matias Martinez, Benjamin Danglot, Thomas Durieux, and Martin Monperrus. 2019. Alleviating patch overfitting with automatic test generation: A study of feasibility and effectiveness for the Nopol repair system. Empirical Software Engineering (EMSE), 24, 1 (2019), 33–67. https://doi.org/10.1007/s10664-018-9619-4
[110]
Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. 2022. miniF2F: A cross-system benchmark for formal Olympiad-level mathematics. In ICLR. https://openreview.net/forum?id=9ZPegFuFTFv
[111]
Qihao Zhu, Zeyu Sun, Yuan an Xiao, Wenjie Zhang, Kang Yuan, Yingfei Xiong, and Lu Zhang. 2021. A syntax-guided edit decoder for neural program repair. In ESEC/FSE. 341–353. https://doi.org/10.1145/3468264.3468544

Cited By

View all
  • (2024) When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs Transactions of the Association for Computational Linguistics10.1162/tacl_a_0071312(1417-1440)Online publication date: 4-Nov-2024
  • (2024) Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies Transactions of the Association for Computational Linguistics10.1162/tacl_a_0066012(484-506)Online publication date: 3-May-2024
  • (2024)Large Language Models for Software Engineering: A Systematic Literature ReviewACM Transactions on Software Engineering and Methodology10.1145/369598833:8(1-79)Online publication date: 20-Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
November 2023
2215 pages
ISBN:9798400703270
DOI:10.1145/3611643
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 November 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Proof assistants
  2. automated formal verification
  3. large language models
  4. machine learning
  5. proof repair
  6. proof synthesis

Qualifiers

  • Research-article

Funding Sources

Conference

ESEC/FSE '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)776
  • Downloads (Last 6 weeks)84
Reflects downloads up to 28 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024) When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs Transactions of the Association for Computational Linguistics10.1162/tacl_a_0071312(1417-1440)Online publication date: 4-Nov-2024
  • (2024) Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies Transactions of the Association for Computational Linguistics10.1162/tacl_a_0066012(484-506)Online publication date: 3-May-2024
  • (2024)Large Language Models for Software Engineering: A Systematic Literature ReviewACM Transactions on Software Engineering and Methodology10.1145/369598833:8(1-79)Online publication date: 20-Sep-2024
  • (2024)If At First You Don’t Succeed, Try, Try, Again...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software SystemsProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695971(63-78)Online publication date: 4-Nov-2024
  • (2024)Proof Automation with Large Language ModelsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695521(1509-1520)Online publication date: 27-Oct-2024
  • (2024)CoqPilot, a plugin for LLM-based generation of proofsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695357(2382-2385)Online publication date: 27-Oct-2024
  • (2024)History-Driven Fuzzing For Deep Learning LibrariesACM Transactions on Software Engineering and Methodology10.1145/368883834:1(1-29)Online publication date: 16-Aug-2024
  • (2024)Generative AI for Self-Adaptive Systems: State of the Art and Research RoadmapACM Transactions on Autonomous and Adaptive Systems10.1145/368680319:3(1-60)Online publication date: 30-Sep-2024
  • (2024)LLM-Enhanced Theorem Proving with Term Explanation and Tactic Parameter Repair✱Proceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3674823(21-30)Online publication date: 24-Jul-2024
  • (2024)Automatic Programming vs. Artificial IntelligenceProceedings of the 1st ACM International Conference on AI-Powered Software10.1145/3664646.3664775(144-146)Online publication date: 10-Jul-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media