No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING IEEE Transactions on Software Engineering Pub Date : 2024-04-23 DOI:10.1109/TSE.2024.3392499

Zhijie Liu;Yutian Tang;Xiapu Luo;Yuming Zhou;Liang Feng Zhang

{"title":"No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT","authors":"Zhijie Liu;Yutian Tang;Xiapu Luo;Yuming Zhou;Liang Feng Zhang","doi":"10.1109/TSE.2024.3392499","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) have demonstrated impressive capabilities across various natural language processing (NLP) tasks, such as machine translation, question answering, summarization, and so on. Additionally, LLMs are also highly valuable in supporting software engineering tasks, particularly in the field of code generation. Automatic code generation is a process of automatically generating source code or executable code based on given specifications or requirements, improving developer productivity. In this study, we perform a systematic empirical assessment to the quality of code generation using \nChatGPT\n, a recent state-of-the-art product LLM. We leverage 728 algorithm problems in five languages (i.e., C, C++, Java, Python, and JavaScript) and 18 CWEs with 54 code scenarios for the code generation task. Our evaluation encompasses a comprehensive analysis of code snippets generated by \nChatGPT\n, focusing on three critical aspects: correctness, complexity, and security. We also specifically investigate \nChatGPT\n's ability to engage in multi-round fixing process (i.e., \nChatGPT\n's dialog ability, chatting between users and \nChatGPT\n for fixing generated buggy code) of facilitating code generation. By delving into the generated code and examining the experimental results, this work provides valuable insights into the performance of \nChatGPT\n in tackling code generation tasks over the three critical aspects. The experimental results demonstrate that (1) \nChatGPT\n is better at generating functionally correct code for problems before 2021 in different languages than problems after 2021 with \n<inline-formula><tex-math>$48.14\\%$</tex-math></inline-formula>\n advantage in \nAccepted\n rate on judgment platform, but \nChatGPT\n's ability to directly fix erroneous code with multi-round fixing process to achieve correct functionality is relatively weak; (2) the distribution of cyclomatic and cognitive complexity levels for code snippets in different languages varies. Furthermore, the multi-round fixing process with \nChatGPT \n generally preserves or increases the complexity levels of code snippets; (3) in algorithm scenarios with languages of C, C++, and Java, and CWE scenarios with languages of C and Python3, the code generated by \nChatGPT \n has relevant vulnerabilities. However, the multi-round fixing process for vulnerable code snippets demonstrates promising results, with more than \n<inline-formula><tex-math>$89\\%$</tex-math></inline-formula>\n of vulnerabilities successfully addressed; and (4) code generation may be affected by \nChatGPT\n's non-determinism factor, resulting in variations of code snippets in functional correctness, complexity, and security. Overall, our findings uncover potential issues and limitations that arise in the \nChatGPT\n-based code generation and lay the groundwork for improving AI and LLM-based code generation techniques.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":null,"pages":null},"PeriodicalIF":6.5000,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10507163/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Large language models (LLMs) have demonstrated impressive capabilities across various natural language processing (NLP) tasks, such as machine translation, question answering, summarization, and so on. Additionally, LLMs are also highly valuable in supporting software engineering tasks, particularly in the field of code generation. Automatic code generation is a process of automatically generating source code or executable code based on given specifications or requirements, improving developer productivity. In this study, we perform a systematic empirical assessment to the quality of code generation using ChatGPT , a recent state-of-the-art product LLM. We leverage 728 algorithm problems in five languages (i.e., C, C++, Java, Python, and JavaScript) and 18 CWEs with 54 code scenarios for the code generation task. Our evaluation encompasses a comprehensive analysis of code snippets generated by ChatGPT , focusing on three critical aspects: correctness, complexity, and security. We also specifically investigate ChatGPT 's ability to engage in multi-round fixing process (i.e., ChatGPT 's dialog ability, chatting between users and ChatGPT for fixing generated buggy code) of facilitating code generation. By delving into the generated code and examining the experimental results, this work provides valuable insights into the performance of ChatGPT in tackling code generation tasks over the three critical aspects. The experimental results demonstrate that (1) ChatGPT is better at generating functionally correct code for problems before 2021 in different languages than problems after 2021 with

$48.14\%$

advantage in Accepted rate on judgment platform, but ChatGPT 's ability to directly fix erroneous code with multi-round fixing process to achieve correct functionality is relatively weak; (2) the distribution of cyclomatic and cognitive complexity levels for code snippets in different languages varies. Furthermore, the multi-round fixing process with ChatGPT generally preserves or increases the complexity levels of code snippets; (3) in algorithm scenarios with languages of C, C++, and Java, and CWE scenarios with languages of C and Python3, the code generated by ChatGPT has relevant vulnerabilities. However, the multi-round fixing process for vulnerable code snippets demonstrates promising results, with more than

$89\%$

of vulnerabilities successfully addressed; and (4) code generation may be affected by ChatGPT 's non-determinism factor, resulting in variations of code snippets in functional correctness, complexity, and security. Overall, our findings uncover potential issues and limitations that arise in the ChatGPT -based code generation and lay the groundwork for improving AI and LLM-based code generation techniques.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

不再需要动动手指？通过 ChatGPT 评估代码生成质量

大型语言模型（LLM）在机器翻译、问题解答、摘要等各种自然语言处理（NLP）任务中都表现出了令人印象深刻的能力。此外，大型语言模型在支持软件工程任务方面也极具价值，尤其是在代码生成领域。自动代码生成是指根据给定的规范或要求自动生成源代码或可执行代码，从而提高开发人员工作效率的过程。在本研究中，我们使用 ChatGPT 对代码生成的质量进行了系统的实证评估，ChatGPT 是最近推出的最先进的 LLM 产品。我们利用五种语言（即 C、C++、Java、Python 和 JavaScript）中的 728 个算法问题和 18 个 CWE 中的 54 个代码场景来完成代码生成任务。我们的评估包括对 ChatGPT 生成的代码片段进行全面分析，重点关注三个关键方面：正确性、复杂性和安全性。我们还特别研究了 ChatGPT 参与多轮修复过程的能力（即 ChatGPT 的对话能力、用户与 ChatGPT 之间为修复生成的错误代码而进行的聊天），以促进代码生成。通过深入研究生成的代码并检查实验结果，这项工作为 ChatGPT 在处理代码生成任务的三个关键方面的性能提供了有价值的见解。实验结果表明：(1) ChatGPT 在生成 2021 年之前不同语言问题的功能正确性代码方面优于 2021 年之后的问题，在判断平台上的接受率优势为 48.14%%$，但 ChatGPT 通过多轮修正过程直接修正错误代码以实现正确功能的能力相对较弱；(2) 不同语言代码片段的循环复杂度和认知复杂度水平分布各不相同。此外，ChatGPT 的多轮修复过程一般会保留或增加代码片段的复杂度级别；（3）在 C、C++ 和 Java 语言的算法场景，以及 C 和 Python3 语言的 CWE 场景中，ChatGPT 生成的代码存在相关漏洞。然而，多轮漏洞代码片段修复过程取得了可喜的成果，超过 89%$ 的漏洞被成功修复；（4）代码生成可能会受到 ChatGPT 非确定性因素的影响，导致代码片段在功能正确性、复杂性和安全性方面存在差异。总之，我们的发现揭示了基于 ChatGPT 的代码生成过程中可能出现的问题和局限性，为改进人工智能和基于 LLM 的代码生成技术奠定了基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.