Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation

Luis Mayer, Christian Heumann, Matthias Aßenmacher
{"title":"Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation","authors":"Luis Mayer, Christian Heumann, Matthias Aßenmacher","doi":"arxiv-2409.04164","DOIUrl":null,"url":null,"abstract":"In recent years, large language models (LLMs) have emerged as powerful tools\nwith potential applications in various fields, including software engineering.\nWithin the scope of this research, we evaluate five different state-of-the-art\nLLMs - Bard, BingChat, ChatGPT, Llama2, and Code Llama - concerning their\ncapabilities for text-to-code generation. In an empirical study, we feed\nprompts with textual descriptions of coding problems sourced from the\nprogramming website LeetCode to the models with the task of creating solutions\nin Python. Subsequently, the quality of the generated outputs is assessed using\nthe testing functionalities of LeetCode. The results indicate large differences\nin performance between the investigated models. ChatGPT can handle these\ntypical programming challenges by far the most effectively, surpassing even\ncode-specialized models like Code Llama. To gain further insights, we measure\nthe runtime as well as the memory usage of the generated outputs and compared\nthem to the other code submissions on Leetcode. A detailed error analysis,\nencompassing a comparison of the differences concerning correct indentation and\nform of the generated code as well as an assignment of the incorrectly solved\ntasks to certain error categories allows us to obtain a more nuanced picture of\nthe results and potential for improvement. The results also show a clear\npattern of increasingly incorrect produced code when the models are facing a\nlot of context in the form of longer prompts.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04164","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In recent years, large language models (LLMs) have emerged as powerful tools with potential applications in various fields, including software engineering. Within the scope of this research, we evaluate five different state-of-the-art LLMs - Bard, BingChat, ChatGPT, Llama2, and Code Llama - concerning their capabilities for text-to-code generation. In an empirical study, we feed prompts with textual descriptions of coding problems sourced from the programming website LeetCode to the models with the task of creating solutions in Python. Subsequently, the quality of the generated outputs is assessed using the testing functionalities of LeetCode. The results indicate large differences in performance between the investigated models. ChatGPT can handle these typical programming challenges by far the most effectively, surpassing even code-specialized models like Code Llama. To gain further insights, we measure the runtime as well as the memory usage of the generated outputs and compared them to the other code submissions on Leetcode. A detailed error analysis, encompassing a comparison of the differences concerning correct indentation and form of the generated code as well as an assignment of the incorrectly solved tasks to certain error categories allows us to obtain a more nuanced picture of the results and potential for improvement. The results also show a clear pattern of increasingly incorrect produced code when the models are facing a lot of context in the form of longer prompts.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
开源能否击败 ChatGPT?-- 用于文本到代码生成的大型语言模型比较研究
近年来,大型语言模型(LLMs)作为一种强大的工具,在包括软件工程在内的各个领域都有潜在的应用前景。在本研究范围内,我们对 Bard、BingChat、ChatGPT、Llama2 和 Code Llama 这五种最先进的大型语言模型进行了评估,以了解它们在文本到代码生成方面的能力。在一项实证研究中,我们将来自编程网站 LeetCode 的编码问题文本描述输入到模型中,让模型用 Python 创建解决方案。随后,我们使用 LeetCode 的测试功能对生成输出的质量进行了评估。结果表明,所研究模型之间的性能差异很大。到目前为止,ChatGPT 能最有效地处理典型的编程挑战,甚至超过了 Code Llama 等代码专用模型。为了进一步深入了解,我们测量了生成输出的运行时间和内存使用情况,并将它们与 Leetcode 上提交的其他代码进行了比较。详细的错误分析包括比较生成代码的正确缩进和形式方面的差异,以及将错误解决的任务分配到特定的错误类别,这使我们能够对结果和改进潜力有更细致的了解。结果还显示了一个明显的模式,即当模型面对大量以较长提示形式出现的上下文时,生成的代码越来越不正确。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Promise and Peril of Collaborative Code Generation Models: Balancing Effectiveness and Memorization Shannon Entropy is better Feature than Category and Sentiment in User Feedback Processing Motivations, Challenges, Best Practices, and Benefits for Bots and Conversational Agents in Software Engineering: A Multivocal Literature Review A Taxonomy of Self-Admitted Technical Debt in Deep Learning Systems Investigating team maturity in an agile automotive reorganization
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1