Domenico Cotroneo, Alessio Foggia, Cristina Improta, Pietro Liguori, Roberto Natella
{"title":"Automating the correctness assessment of AI-generated code for security contexts","authors":"Domenico Cotroneo, Alessio Foggia, Cristina Improta, Pietro Liguori, Roberto Natella","doi":"10.1016/j.jss.2024.112113","DOIUrl":null,"url":null,"abstract":"<div><p>Evaluating the correctness of code generated by AI is a challenging open problem. In this paper, we propose a fully automated method, named <em>ACCA</em>, to evaluate the correctness of AI-generated code for security purposes. The method uses symbolic execution to assess whether the AI-generated code behaves as a reference implementation. We use <em>ACCA</em> to assess four state-of-the-art models trained to generate security-oriented assembly code and compare the results of the evaluation with different baseline solutions, including output similarity metrics, widely used in the field, and the well-known ChatGPT, the AI-powered language model developed by OpenAI.</p><p>Our experiments show that our method outperforms the baseline solutions and assesses the correctness of the AI-generated code similar to the human-based evaluation, which is considered the ground truth for the assessment in the field. Moreover, <em>ACCA</em> has a very strong correlation with the human evaluation (Pearson’s correlation coefficient <span><math><mrow><mi>r</mi><mo>=</mo><mn>0</mn><mo>.</mo><mn>84</mn></mrow></math></span> on average). Finally, since it is a full y automated solution that does not require any human intervention, the proposed method performs the assessment of every code snippet in <span><math><mrow><mo>∼</mo><mn>0</mn><mo>.</mo><mn>17</mn></mrow></math></span> s on average, which is definitely lower than the average time required by human analysts to manually inspect the code, based on our experience.</p></div>","PeriodicalId":51099,"journal":{"name":"Journal of Systems and Software","volume":null,"pages":null},"PeriodicalIF":3.7000,"publicationDate":"2024-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0164121224001584/pdfft?md5=c7734d2003ab22f80edc9da32fb97026&pid=1-s2.0-S0164121224001584-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems and Software","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0164121224001584","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
Abstract
Evaluating the correctness of code generated by AI is a challenging open problem. In this paper, we propose a fully automated method, named ACCA, to evaluate the correctness of AI-generated code for security purposes. The method uses symbolic execution to assess whether the AI-generated code behaves as a reference implementation. We use ACCA to assess four state-of-the-art models trained to generate security-oriented assembly code and compare the results of the evaluation with different baseline solutions, including output similarity metrics, widely used in the field, and the well-known ChatGPT, the AI-powered language model developed by OpenAI.
Our experiments show that our method outperforms the baseline solutions and assesses the correctness of the AI-generated code similar to the human-based evaluation, which is considered the ground truth for the assessment in the field. Moreover, ACCA has a very strong correlation with the human evaluation (Pearson’s correlation coefficient on average). Finally, since it is a full y automated solution that does not require any human intervention, the proposed method performs the assessment of every code snippet in s on average, which is definitely lower than the average time required by human analysts to manually inspect the code, based on our experience.
期刊介绍:
The Journal of Systems and Software publishes papers covering all aspects of software engineering and related hardware-software-systems issues. All articles should include a validation of the idea presented, e.g. through case studies, experiments, or systematic comparisons with other approaches already in practice. Topics of interest include, but are not limited to:
• Methods and tools for, and empirical studies on, software requirements, design, architecture, verification and validation, maintenance and evolution
• Agile, model-driven, service-oriented, open source and global software development
• Approaches for mobile, multiprocessing, real-time, distributed, cloud-based, dependable and virtualized systems
• Human factors and management concerns of software development
• Data management and big data issues of software systems
• Metrics and evaluation, data mining of software development resources
• Business and economic aspects of software development processes
The journal welcomes state-of-the-art surveys and reports of practical experience for all of these topics.