NO.217 Trusted Automatic Programming

Shonan Village Center

January 20 - 23, 2025 (Check-in: January 19, 2025 )


  • Corina Pasareanu
    • NASA Ames, USA and Carnegie Mellon University, USA
  • Abhik Roychoudhury
    • National University of Singapore, Singapore
  • Adish Singla
    • Max Planck Institute, Germany


Description of the meeting:

The task of programming both in terms of intent capture as well as in generation of correct code – has occupied much of the Computing profession for the last 50-60 years. There has been significant progress in modeling and system design to support accurate intent capture leading to the growth of formal specifications. However, despite all the progress – software engineers are reluctant to write formal specifications, and for large software systems a formal description of intent is not available – leading to tremendous hardship in debugging and fixing errors.

Subsequently, the developments have moved to testing and analysis methods to help develop trustworthy codebases, despite the lack of formal capture of intent. These works often go with the goal of achieving higher behavioral coverage as in testing, and often work with simple test-oracles, such as the use of crash-freedom oracles in the widely popular fuzzing techniques. Since the oracles denote the expected behavior of test-cases by assuming trivial oracles, the need for specifying the programmer’s intent formally is obviated. Fuzzing methods have witnessed tremendous popularity in the recent decade with a huge number of vulnerabilities being found in widely used software systems via fuzzing. Nevertheless, the quest to achieve functional correctness in software without having to author voluminous formal specifications remain.

Recent developments on automated code generation from large language models (LLMs) provide us with a fresh perspective in this regard. Since LLM based code generation allows for programming from natural language specification – it seems to indicate a promise to achieve the goal of auto-coding. This raises not only the overall question of correctness of automatically generated code, but at what point we can start trusting automatically generated code. In past decades, niche industries have generated code from models, however there is no precedent of automatically generated code from natural language specifications being used widely.

[Questions to be studied] In the proposed Shonan Meeting, we will coordinate an exchange among researchers in machine learning / machine teaching, formal methods and software engineering – to achieve answers to the following questions.

  • What is the acceptable evidence for which code from LLMs can be integrated into a software project?

  • How do the trust boundaries shift when we integrate automatically generated code instead of manually written code? Are the acceptability criteria likely to be more stringent for automatically generated code?

  • Can automatic program repair methods be used to improve the code generated from LLMs? Can repair techniques generate evidence so that the improved code can be accepted into codebases?

  • An emerging pattern is Counterexample-guided Synthesis, similar to Counterexample guided inductive synthesis (CEGIS), where the LLM is used as a blackbox synthesis engine and one can verify the product of it. Can this approach be used to obtain the required guarantees for the code produced by LLMs?

  • Improvement in precision of the LLM models can complement the goal of trustworthiness of automatically generated code. Can these dual criteria lead to a tunable set of parameters in future, i.e. code from smaller LLMs will need to go through stringent set of checks and improvement?

  • Another approach could be to constrain the output of the LLMs, as fine-tuning LLMs may not be effective in practice; this would be similar to current neuro-symbolic approaches but applied to LLMs. What techniques can be used to effectively constrain the LLMs outputs to obtain desired guarantees?

  • How can the advances in LLMs impact programming education and programming teaching of the future? Can it lead to intelligent tutoring systems which interact with both students and LLMs?

  • Last but not the least, what is the appropriate user study design which will allow us to examine the level of trust developers have in code from LLMs?